Am 27.06.2014 um 12:24 schrieb Txema Heredia:

> El 27/06/14 11:31, Reuti escribió:
>> Hi,
>> 
>> Am 26.06.2014 um 17:56 schrieb Txema Heredia:
>> 
>>> <snip>
>>> 
>>> # qstat -j 4561291 -cb | grep "job_name\|binding\|queue_list"
>>> job_name:                   c0-1
>>> hard_queue_list:            *@compute-0-1.local
>>> binding:                    set linear:1:0,0
>>> binding    1:               NONE
>>> 
>>> What I am missing here? What can be different in my nodes?
>> Does `qhost -F` output the fields:
>> 
>> $ qhost -F
>> ...
>>    hl:m_topology=SC
>>    hl:m_topology_inuse=SC
>>    hl:m_socket=1.000000
>>    hl:m_core=1.000000
>> 
>> for this machine?
>> 
>> -- Reuti
> Yes, qhost -F reports that for all nodes:
> 
> # qhost -F | grep "compute\|hl:m_"
> compute-0-0             lx26-amd64     12  0.60   94.6G   10.1G 9.8G   53.8M
>   hl:m_topology=SCCCCCCSCCCCCC
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
>   hl:m_socket=2.000000
>   hl:m_core=12.000000
> compute-0-1             lx26-amd64     12  7.21   94.6G   14.9G 9.8G   86.6M
>   hl:m_topology=SCCCCCCSCCCCCC
>   hl:m_topology_inuse=ScCCCCCSCCCCCC
>   hl:m_socket=2.000000
>   hl:m_core=12.000000
> ...
> 
> 
> But the inuse topology is blatantly wrong.

What version of SGE are you using? Maybe the "PLPA" which was used in former 
versions doesn't support this particular CPU's topology. It was replaced by 
"hwloc" later on.

-- Reuti


> This is a combination of "qhost -F" + "qstat -cb -j" for all jobs on all 
> nodes (incoming wall of text):
> 
> 
> # for i in $(seq 0 1); do for j in $(seq 0 11); do comp="compute-${i}-${j}"; 
> qhost -F -h ${comp} | grep "${comp}\|inuse" ; for id in $(qstat -u *, -s r -q 
> all.q@${comp} | grep ${comp} | awk '{print $1}'); do qstat -cb -j ${id} | 
> grep "binding" ; done; done; done
> 
> compute-0-0             lx26-amd64     12  0.40   94.6G   10.1G 9.8G   53.8M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> 
> compute-0-1             lx26-amd64     12  7.12   94.6G   14.0G 9.8G   86.6M
>   hl:m_topology_inuse=ScCCCCCSCCCCCC
> binding:                    set linear:1
> binding    1:               ScCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-2             lx26-amd64     12  7.21   94.6G   18.8G 9.8G   49.4M
>   hl:m_topology_inuse=SCcCCCCSCCCCCC
> binding:                    set linear:1
> binding    1:               SCcCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-3             lx26-amd64     12  7.08   94.6G   13.6G 9.8G  128.5M
>   hl:m_topology_inuse=ScCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> binding:                    set linear:1:0,0
> binding    1:               ScCCCCCSCCCCCC
> 
> compute-0-4             lx26-amd64     12  6.06   94.6G   12.4G 9.8G   79.5M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-5             lx26-amd64     12  7.11   94.6G   31.6G 9.8G   92.4M
>   hl:m_topology_inuse=SccccccSCcCCCC
> binding:                    set linear:1
> binding    1:               SCCCCCCSCcCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> 
> compute-0-6             lx26-amd64     12  6.05   94.6G   15.2G 9.8G   48.3M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-7             lx26-amd64     12  6.09   94.6G   32.0G 9.8G   96.7M
>   hl:m_topology_inuse=SccccccSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> 
> compute-0-8             lx26-amd64     12  6.19   94.6G   31.3G 9.8G  101.1M
>   hl:m_topology_inuse=SccccccSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> 
> compute-0-9             lx26-amd64     12  6.11   94.6G   12.4G 9.8G  115.9M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-10            lx26-amd64     12  6.16   94.6G   15.4G 9.8G   85.7M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-0-11            lx26-amd64     12  7.11   94.6G   13.3G 9.8G   60.3M
>   hl:m_topology_inuse=SCCCCCCScCCCCC
> binding:                    set linear:1
> binding    1:               SCCCCCCScCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-1-0             lx26-amd64     12  2.57   94.6G   10.6G 9.8G   53.3M
>   hl:m_topology_inuse=SccCCCCSCCCCCC
> binding:                    set linear:1
> binding    1:               SCcCCCCSCCCCCC
> binding:                    set linear:1:0,0
> binding    1:               ScCCCCCSCCCCCC
> 
> compute-1-1             lx26-amd64     12  1.23   94.6G   10.2G 9.8G   92.8M
>   hl:m_topology_inuse=SCCCCCCScCCCCC
> binding:                    set linear:1
> binding    1:               SCCCCCCScCCCCC
> 
> compute-1-2             lx26-amd64     12  0.35   94.6G   10.3G 9.8G   40.7M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> 
> compute-1-3             lx26-amd64     12  1.70   94.6G   10.2G 9.8G   44.8M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:1:0,0
> binding    1:               NONE
> binding:                    set linear:4:0,0
> binding    1:               NONE
> 
> compute-1-4             lx26-amd64     12  6.15   94.6G   14.2G 9.8G   58.7M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-1-5             lx26-amd64     12  7.07   94.6G   33.9G 9.8G   46.1M
>   hl:m_topology_inuse=SccccccSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> binding:                    set linear:1:0,0
> binding    1:               NONE
> 
> compute-1-6             lx26-amd64     12  7.13   94.6G   31.5G 9.8G   78.1M
>   hl:m_topology_inuse=SccccccSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> binding:                    set linear:1:0,0
> binding    1:               NONE
> 
> compute-1-7             lx26-amd64     12  6.14   94.6G   16.3G 9.8G   40.2M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               NONE
> 
> compute-1-8             lx26-amd64     12  7.10   94.6G   32.2G 9.8G   39.4M
>   hl:m_topology_inuse=SccccccSCCCCCC
> binding:                    set linear:6:0,0
> binding    1:               SccccccSCCCCCC
> binding:                    set linear:1:0,0
> binding    1:               NONE
> 
> compute-1-9             lx26-amd64     12  0.45   94.6G   10.1G 9.8G   39.0M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> 
> compute-1-10            lx26-amd64     12  0.40   94.6G   10.5G 9.8G  191.2M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> 
> compute-1-11            lx26-amd64     12  0.02   94.6G   10.6G 9.8G   34.4M
>   hl:m_topology_inuse=SCCCCCCSCCCCCC
> 
> 
> 
> 
> As you can see, some nodes report everything properly, some nodes report 
> something, some nodes report nothing. There is also no coherence among jobs. 
> All 6-core jobs are the same, submitted at once by the same user, but some 
> make use of the binding and some don't. I have even witnessed 3 4-core jobs 
> running in the same node, where one was showing the proper binding and the 
> other 2 had NONE (I couldn't capture them).
> 
> I have just realized that I have had for a few days both a jsv and a wrapper 
> script that modified the binding:
> 
> jsv:
> 
>        my $binding=1;
>        if (jsv_is_param('pe_name')) {
>                if (jsv_is_param('pe_max')) {
>                        $binding=jsv_get_param('pe_max');
>                } else{
>                        $binding=1;
>                }
>        } else {
>                $binding=1;
>        }
>        jsv_set_param('binding_type','set');
>        jsv_set_param('binding_strategy','linear');
>        jsv_set_param('binding_amount',$binding);
> 
> 
> wrapper: (users run this script instead of qsub, and this script modifies and 
> calls the qsub command line. Everything before jsv's)
> 
> if (!$pe)
>        $binding = " -binding linear:1 ";
> } else {
>        $binding = " -binding linear:$pe_slots_num ";
> }
> ...
> qsub_orig ... $binding $original_params
> 
> 
> But, for instance, all the currently running 6-core jobs were submitted when 
> both scripts were used, but each job behaves differently.
> 
> What is happening here?
> 
> 
> PS: I have just disabled the wrapper binding thing. I'll check if now 
> everything works fine.
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to