El 27/06/14 11:31, Reuti escribió:
Hi,

Am 26.06.2014 um 17:56 schrieb Txema Heredia:

<snip>

# qstat -j 4561291 -cb | grep "job_name\|binding\|queue_list"
job_name:                   c0-1
hard_queue_list:            *@compute-0-1.local
binding:                    set linear:1:0,0
binding    1:               NONE

What I am missing here? What can be different in my nodes?
Does `qhost -F` output the fields:

$ qhost -F
...
    hl:m_topology=SC
    hl:m_topology_inuse=SC
    hl:m_socket=1.000000
    hl:m_core=1.000000

for this machine?

-- Reuti
Yes, qhost -F reports that for all nodes:

# qhost -F | grep "compute\|hl:m_"
compute-0-0             lx26-amd64     12  0.60   94.6G   10.1G 9.8G   53.8M
   hl:m_topology=SCCCCCCSCCCCCC
   hl:m_topology_inuse=SCCCCCCSCCCCCC
   hl:m_socket=2.000000
   hl:m_core=12.000000
compute-0-1             lx26-amd64     12  7.21   94.6G   14.9G 9.8G   86.6M
   hl:m_topology=SCCCCCCSCCCCCC
   hl:m_topology_inuse=ScCCCCCSCCCCCC
   hl:m_socket=2.000000
   hl:m_core=12.000000
...


But the inuse topology is blatantly wrong. This is a combination of "qhost -F" + "qstat -cb -j" for all jobs on all nodes (incoming wall of text):


# for i in $(seq 0 1); do for j in $(seq 0 11); do comp="compute-${i}-${j}"; qhost -F -h ${comp} | grep "${comp}\|inuse" ; for id in $(qstat -u *, -s r -q all.q@${comp} | grep ${comp} | awk '{print $1}'); do qstat -cb -j ${id} | grep "binding" ; done; done; done

compute-0-0             lx26-amd64     12  0.40   94.6G   10.1G 9.8G   53.8M
   hl:m_topology_inuse=SCCCCCCSCCCCCC

compute-0-1             lx26-amd64     12  7.12   94.6G   14.0G 9.8G   86.6M
   hl:m_topology_inuse=ScCCCCCSCCCCCC
binding:                    set linear:1
binding    1:               ScCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-2             lx26-amd64     12  7.21   94.6G   18.8G 9.8G   49.4M
   hl:m_topology_inuse=SCcCCCCSCCCCCC
binding:                    set linear:1
binding    1:               SCcCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-3             lx26-amd64     12  7.08   94.6G   13.6G 9.8G  128.5M
   hl:m_topology_inuse=ScCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE
binding:                    set linear:1:0,0
binding    1:               ScCCCCCSCCCCCC

compute-0-4             lx26-amd64     12  6.06   94.6G   12.4G 9.8G   79.5M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-5             lx26-amd64     12  7.11   94.6G   31.6G 9.8G   92.4M
   hl:m_topology_inuse=SccccccSCcCCCC
binding:                    set linear:1
binding    1:               SCCCCCCSCcCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC

compute-0-6             lx26-amd64     12  6.05   94.6G   15.2G 9.8G   48.3M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-7             lx26-amd64     12  6.09   94.6G   32.0G 9.8G   96.7M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC

compute-0-8             lx26-amd64     12  6.19   94.6G   31.3G 9.8G  101.1M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC

compute-0-9             lx26-amd64     12  6.11   94.6G   12.4G 9.8G  115.9M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-10            lx26-amd64     12  6.16   94.6G   15.4G 9.8G   85.7M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-0-11            lx26-amd64     12  7.11   94.6G   13.3G 9.8G   60.3M
   hl:m_topology_inuse=SCCCCCCScCCCCC
binding:                    set linear:1
binding    1:               SCCCCCCScCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-1-0             lx26-amd64     12  2.57   94.6G   10.6G 9.8G   53.3M
   hl:m_topology_inuse=SccCCCCSCCCCCC
binding:                    set linear:1
binding    1:               SCcCCCCSCCCCCC
binding:                    set linear:1:0,0
binding    1:               ScCCCCCSCCCCCC

compute-1-1             lx26-amd64     12  1.23   94.6G   10.2G 9.8G   92.8M
   hl:m_topology_inuse=SCCCCCCScCCCCC
binding:                    set linear:1
binding    1:               SCCCCCCScCCCCC

compute-1-2             lx26-amd64     12  0.35   94.6G   10.3G 9.8G   40.7M
   hl:m_topology_inuse=SCCCCCCSCCCCCC

compute-1-3             lx26-amd64     12  1.70   94.6G   10.2G 9.8G   44.8M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:1:0,0
binding    1:               NONE
binding:                    set linear:4:0,0
binding    1:               NONE

compute-1-4             lx26-amd64     12  6.15   94.6G   14.2G 9.8G   58.7M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-1-5             lx26-amd64     12  7.07   94.6G   33.9G 9.8G   46.1M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC
binding:                    set linear:1:0,0
binding    1:               NONE

compute-1-6             lx26-amd64     12  7.13   94.6G   31.5G 9.8G   78.1M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC
binding:                    set linear:1:0,0
binding    1:               NONE

compute-1-7             lx26-amd64     12  6.14   94.6G   16.3G 9.8G   40.2M
   hl:m_topology_inuse=SCCCCCCSCCCCCC
binding:                    set linear:6:0,0
binding    1:               NONE

compute-1-8             lx26-amd64     12  7.10   94.6G   32.2G 9.8G   39.4M
   hl:m_topology_inuse=SccccccSCCCCCC
binding:                    set linear:6:0,0
binding    1:               SccccccSCCCCCC
binding:                    set linear:1:0,0
binding    1:               NONE

compute-1-9             lx26-amd64     12  0.45   94.6G   10.1G 9.8G   39.0M
   hl:m_topology_inuse=SCCCCCCSCCCCCC

compute-1-10            lx26-amd64     12  0.40   94.6G   10.5G 9.8G  191.2M
   hl:m_topology_inuse=SCCCCCCSCCCCCC

compute-1-11            lx26-amd64     12  0.02   94.6G   10.6G 9.8G   34.4M
   hl:m_topology_inuse=SCCCCCCSCCCCCC




As you can see, some nodes report everything properly, some nodes report something, some nodes report nothing. There is also no coherence among jobs. All 6-core jobs are the same, submitted at once by the same user, but some make use of the binding and some don't. I have even witnessed 3 4-core jobs running in the same node, where one was showing the proper binding and the other 2 had NONE (I couldn't capture them).

I have just realized that I have had for a few days both a jsv and a wrapper script that modified the binding:

jsv:

        my $binding=1;
        if (jsv_is_param('pe_name')) {
                if (jsv_is_param('pe_max')) {
                        $binding=jsv_get_param('pe_max');
                } else{
                        $binding=1;
                }
        } else {
                $binding=1;
        }
        jsv_set_param('binding_type','set');
        jsv_set_param('binding_strategy','linear');
        jsv_set_param('binding_amount',$binding);


wrapper: (users run this script instead of qsub, and this script modifies and calls the qsub command line. Everything before jsv's)

if (!$pe)
        $binding = " -binding linear:1 ";
} else {
        $binding = " -binding linear:$pe_slots_num ";
}
...
qsub_orig ... $binding $original_params


But, for instance, all the currently running 6-core jobs were submitted when both scripts were used, but each job behaves differently.

What is happening here?


PS: I have just disabled the wrapper binding thing. I'll check if now everything works fine.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to