RE: Hive indexing optimization

Bennie Leo Mon, 29 Jun 2015 10:36:58 -0700
Here is the explain output:
 
STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: logontable
                  filterExpr: isipv4(ip) (type: boolean)
                  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE
                  Filter Operator
                    predicate: isipv4(ip) (type: boolean)
                    Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
                    Select Operator
                      expressions: ip (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
                      Reduce Output Operator
                        sort order: 
                        Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
                        value expressions: _col0 (type: bigint)
        Map 3 
            Map Operator Tree:
                TableScan
                  alias: ipv4geotable
                  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      sort order: 
                      Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Left Outer Join0 to 1
                condition expressions:
                  0 {VALUE._col0} 
                  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
                filter predicates:
                  0 {isipv4(VALUE._col0)}
                  1 
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)
                  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: bigint), _col3 (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
 
Thank you,
B
 
> Subject: Re: Hive indexing optimization
> From: jpullokka...@hortonworks.com
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> Date: Sat, 27 Jun 2015 16:02:08 +0000
> 
> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
> rewritten as a scan against index table.
> 
> BitMap Indexes seems to support inequalities (<=, <, >=).
> 
> Post the explain plan.
> 
> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote:
> 
> >Hi,
> >
> >Hive indexes won¹t really help you speed up that query right now, because
> >of the plan it generates due to the <= clauses.
> >
> >> CREATETABLE ipv4table
> >> AS
> >> SELECT logon.IP, ipv4.Country
> >> FROM
> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon
> >> LEFT OUTER JOIN
> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
> >> isIpv4(logon.IP)
> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;
> >
> >That¹s a cross-product join, which can¹t be distributed at all & will take
> >forever, even if you use Tez/LLAP.
> >
> >Range join queries have partial distribution rewrites, but AFAIK none of
> >them use filter indexes in hive.
> >
> >But before I suggest a rewrite, can you post the ³explain <query>;² for
> >the above query, so that I can check if this is indeed producing a
> >cross-product + filter?
> >
> >Cheers,
> >Gopal
> >
> >
> >
> >
> >
> >
> >
> >
>
RE: Hive indexing optimization

Reply via email to