El 27/06/14 11:31, Reuti escribió:
Hi,
Am 26.06.2014 um 17:56 schrieb Txema Heredia:
<snip>
# qstat -j 4561291 -cb | grep "job_name\|binding\|queue_list"
job_name: c0-1
hard_queue_list: *@compute-0-1.local
binding: set linear:1:0,0
binding 1: NONE
What I am missing here? What can be different in my nodes?
Does `qhost -F` output the fields:
$ qhost -F
...
hl:m_topology=SC
hl:m_topology_inuse=SC
hl:m_socket=1.000000
hl:m_core=1.000000
for this machine?
-- Reuti
Yes, qhost -F reports that for all nodes:
# qhost -F | grep "compute\|hl:m_"
compute-0-0 lx26-amd64 12 0.60 94.6G 10.1G 9.8G 53.8M
hl:m_topology=SCCCCCCSCCCCCC
hl:m_topology_inuse=SCCCCCCSCCCCCC
hl:m_socket=2.000000
hl:m_core=12.000000
compute-0-1 lx26-amd64 12 7.21 94.6G 14.9G 9.8G 86.6M
hl:m_topology=SCCCCCCSCCCCCC
hl:m_topology_inuse=ScCCCCCSCCCCCC
hl:m_socket=2.000000
hl:m_core=12.000000
...
But the inuse topology is blatantly wrong. This is a combination of
"qhost -F" + "qstat -cb -j" for all jobs on all nodes (incoming wall of
text):
# for i in $(seq 0 1); do for j in $(seq 0 11); do
comp="compute-${i}-${j}"; qhost -F -h ${comp} | grep "${comp}\|inuse" ;
for id in $(qstat -u *, -s r -q all.q@${comp} | grep ${comp} | awk
'{print $1}'); do qstat -cb -j ${id} | grep "binding" ; done; done; done
compute-0-0 lx26-amd64 12 0.40 94.6G 10.1G 9.8G 53.8M
hl:m_topology_inuse=SCCCCCCSCCCCCC
compute-0-1 lx26-amd64 12 7.12 94.6G 14.0G 9.8G 86.6M
hl:m_topology_inuse=ScCCCCCSCCCCCC
binding: set linear:1
binding 1: ScCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-2 lx26-amd64 12 7.21 94.6G 18.8G 9.8G 49.4M
hl:m_topology_inuse=SCcCCCCSCCCCCC
binding: set linear:1
binding 1: SCcCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-3 lx26-amd64 12 7.08 94.6G 13.6G 9.8G 128.5M
hl:m_topology_inuse=ScCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
binding: set linear:1:0,0
binding 1: ScCCCCCSCCCCCC
compute-0-4 lx26-amd64 12 6.06 94.6G 12.4G 9.8G 79.5M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-5 lx26-amd64 12 7.11 94.6G 31.6G 9.8G 92.4M
hl:m_topology_inuse=SccccccSCcCCCC
binding: set linear:1
binding 1: SCCCCCCSCcCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
compute-0-6 lx26-amd64 12 6.05 94.6G 15.2G 9.8G 48.3M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-7 lx26-amd64 12 6.09 94.6G 32.0G 9.8G 96.7M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
compute-0-8 lx26-amd64 12 6.19 94.6G 31.3G 9.8G 101.1M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
compute-0-9 lx26-amd64 12 6.11 94.6G 12.4G 9.8G 115.9M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-10 lx26-amd64 12 6.16 94.6G 15.4G 9.8G 85.7M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-0-11 lx26-amd64 12 7.11 94.6G 13.3G 9.8G 60.3M
hl:m_topology_inuse=SCCCCCCScCCCCC
binding: set linear:1
binding 1: SCCCCCCScCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-1-0 lx26-amd64 12 2.57 94.6G 10.6G 9.8G 53.3M
hl:m_topology_inuse=SccCCCCSCCCCCC
binding: set linear:1
binding 1: SCcCCCCSCCCCCC
binding: set linear:1:0,0
binding 1: ScCCCCCSCCCCCC
compute-1-1 lx26-amd64 12 1.23 94.6G 10.2G 9.8G 92.8M
hl:m_topology_inuse=SCCCCCCScCCCCC
binding: set linear:1
binding 1: SCCCCCCScCCCCC
compute-1-2 lx26-amd64 12 0.35 94.6G 10.3G 9.8G 40.7M
hl:m_topology_inuse=SCCCCCCSCCCCCC
compute-1-3 lx26-amd64 12 1.70 94.6G 10.2G 9.8G 44.8M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:1:0,0
binding 1: NONE
binding: set linear:4:0,0
binding 1: NONE
compute-1-4 lx26-amd64 12 6.15 94.6G 14.2G 9.8G 58.7M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-1-5 lx26-amd64 12 7.07 94.6G 33.9G 9.8G 46.1M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
binding: set linear:1:0,0
binding 1: NONE
compute-1-6 lx26-amd64 12 7.13 94.6G 31.5G 9.8G 78.1M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
binding: set linear:1:0,0
binding 1: NONE
compute-1-7 lx26-amd64 12 6.14 94.6G 16.3G 9.8G 40.2M
hl:m_topology_inuse=SCCCCCCSCCCCCC
binding: set linear:6:0,0
binding 1: NONE
compute-1-8 lx26-amd64 12 7.10 94.6G 32.2G 9.8G 39.4M
hl:m_topology_inuse=SccccccSCCCCCC
binding: set linear:6:0,0
binding 1: SccccccSCCCCCC
binding: set linear:1:0,0
binding 1: NONE
compute-1-9 lx26-amd64 12 0.45 94.6G 10.1G 9.8G 39.0M
hl:m_topology_inuse=SCCCCCCSCCCCCC
compute-1-10 lx26-amd64 12 0.40 94.6G 10.5G 9.8G 191.2M
hl:m_topology_inuse=SCCCCCCSCCCCCC
compute-1-11 lx26-amd64 12 0.02 94.6G 10.6G 9.8G 34.4M
hl:m_topology_inuse=SCCCCCCSCCCCCC
As you can see, some nodes report everything properly, some nodes report
something, some nodes report nothing. There is also no coherence among
jobs. All 6-core jobs are the same, submitted at once by the same user,
but some make use of the binding and some don't. I have even witnessed 3
4-core jobs running in the same node, where one was showing the proper
binding and the other 2 had NONE (I couldn't capture them).
I have just realized that I have had for a few days both a jsv and a
wrapper script that modified the binding:
jsv:
my $binding=1;
if (jsv_is_param('pe_name')) {
if (jsv_is_param('pe_max')) {
$binding=jsv_get_param('pe_max');
} else{
$binding=1;
}
} else {
$binding=1;
}
jsv_set_param('binding_type','set');
jsv_set_param('binding_strategy','linear');
jsv_set_param('binding_amount',$binding);
wrapper: (users run this script instead of qsub, and this script
modifies and calls the qsub command line. Everything before jsv's)
if (!$pe)
$binding = " -binding linear:1 ";
} else {
$binding = " -binding linear:$pe_slots_num ";
}
...
qsub_orig ... $binding $original_params
But, for instance, all the currently running 6-core jobs were submitted
when both scripts were used, but each job behaves differently.
What is happening here?
PS: I have just disabled the wrapper binding thing. I'll check if now
everything works fine.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users