Here is the explain output:
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: logontable
filterExpr: isipv4(ip) (type: boolean)
Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL
Column stats: NONE
Filter Operator
predicate: isipv4(ip) (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE
Column stats: NONE
Select Operator
expressions: ip (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE
Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE
Column stats: NONE
value expressions: _col0 (type: bigint)
Map 3
Map Operator Tree:
TableScan
alias: ipv4geotable
Statistics: Num rows: 41641243 Data size: 5144651200 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: startip (type: bigint), endip (type: bigint),
country (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 41641243 Data size: 5144651200 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 41641243 Data size: 5144651200
Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint), _col1 (type:
bigint), _col2 (type: string)
Reducer 2
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
filter predicates:
0 {isipv4(VALUE._col0)}
1
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43281312 Data size: 5020632576 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type:
boolean)
Statistics: Num rows: 5209034 Data size: 497855986 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col3 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5209034 Data size: 497855986 Basic
stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5209034 Data size: 497855986 Basic
stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Thank you,
B
> Subject: Re: Hive indexing optimization
> From: jpullokka...@hortonworks.com
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> Date: Sat, 27 Jun 2015 16:02:08 +0000
>
> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
> rewritten as a scan against index table.
>
> BitMap Indexes seems to support inequalities (<=, <, >=).
>
> Post the explain plan.
>
> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote:
>
> >Hi,
> >
> >Hive indexes won¹t really help you speed up that query right now, because
> >of the plan it generates due to the <= clauses.
> >
> >> CREATETABLE ipv4table
> >> AS
> >> SELECT logon.IP, ipv4.Country
> >> FROM
> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon
> >> LEFT OUTER JOIN
> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
> >> isIpv4(logon.IP)
> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;
> >
> >That¹s a cross-product join, which can¹t be distributed at all & will take
> >forever, even if you use Tez/LLAP.
> >
> >Range join queries have partial distribution rewrites, but AFAIK none of
> >them use filter indexes in hive.
> >
> >But before I suggest a rewrite, can you post the ³explain <query>;² for
> >the above query, so that I can check if this is indeed producing a
> >cross-product + filter?
> >
> >Cheers,
> >Gopal
> >
> >
> >
> >
> >
> >
> >
> >
>