from:"Bennie Leo"

RE: Optimizing UDF

2015-07-14 Thread Bennie Leo

Thanks for your reply.
 
I am already using Tez (sorry, forgot to mention this), and my goal is indeed 
to build the instance once per container.
 
I'm sorry I don't understand what the solution would be with Tez. Are you 
saying that the object should be a private final? The only element I would be 
missing in this case is the final keyword. I fail to see how this will make a 
difference...
 
Thanks,
B

> Date: Tue, 14 Jul 2015 15:19:16 -0700
> Subject: Re: Optimizing UDF
> From: gop...@apache.org
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> 
> 
>  
> > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
> >takes in a 5GB table and builds a large data structure out of it to
> >facilitate lookups. The 5GB input is loaded into the distributed cache
> >with an 'add file ' command, and the UDF builds
> > the data structure a single time per instance (or so it should).
> 
> No, this builds it once per map attempt in MRv2, because each JVM is
> killed after executing a single map attempt.
> 
> In Tez, however you can build this once per container (usually, a ~10x
> perf improvement).
> 
> This has a fix in Tez, since the UDFs can only load it over the network
> once per JVM init and you can hang onto that in the loaded GenericUDF
> object (*not* a static, but a private final), which is held in the
> TezCache as long as the task keeps running the same vertex.
> 
> That will be thrown away whenever the container switches over to running a
> reducer, so the cache is transient.
> 
> Cheers,
> Gopal
> 
>

Optimizing UDF

2015-07-14 Thread Bennie Leo

Hi,
 
I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a 
5GB table and builds a large data structure out of it to facilitate lookups. 
The 5GB input is loaded into the distributed cache with an 'add file ' 
command, and the UDF builds the data structure a single time per instance (or 
so it should). 
 
My problem is that the Hive UDF takes several hours to complete, while running 
the exact same code on my local machine takes 5 minutes! What could be causing 
Hive to be so impractically slow? According to the Hive logs, the data transfer 
takes 5-10 minutes, which is reasonable. What else is taking so long?
 
Thanks,
B

RE: Limiting outer join

2015-07-07 Thread Bennie Leo

It went from about 60 mins to 3 mins. Hive was traversing the whole table 
multiple times, which is obviously inefficient!
 
> Date: Tue, 7 Jul 2015 15:55:19 -0700
> Subject: Re: Limiting outer join
> From: gop...@apache.org
> To: user@hive.apache.org
> 
> 
> > Never mind, I got it working with UDF. I just pass the file location to
> >my evaluate function. Thanks! :)
> 
> Nice. Would be very interested in looking at performance of such a UDF, if
> you have numbers before/after.
> 
> I suspect it will be a magnitude or more faster than the BETWEEN/JOIN
> clauses.
> 
> Cheers,
> Gopal
> 
>

RE: Limiting outer join

2015-07-07 Thread Bennie Leo

Never mind, I got it working with UDF. I just pass the file location to my 
evaluate function. Thanks! :)

From: tben...@hotmail.com
To: user@hive.apache.org
Subject: RE: Limiting outer join
Date: Tue, 7 Jul 2015 09:59:22 -0700

Thanks for your replies.

I see how extracting the first country would work, however I was hoping to 
speed up my query by stopping the search once a country has been found.

Are you suggesting that I pass the whole IP table to a UDF and perform the 
search myself? I've only programmed simple UDFs so far (ex: reformat a string), 
so any additional details would be appreciated. I am mostly concerned about 
importing said table (currently stored in Hive) and distributing the task 
across nodes (note that I use Tez).

Regards,
B

> Date: Mon, 6 Jul 2015 18:18:44 -0700
> Subject: Re: Limiting outer join
> From: gop...@apache.org
> To: user@hive.apache.org
> 
> 
> > In the following query, it is possible to limit the amount of entries
> >returned by an outer join to a single value? I want to obtain a single
> >country from ipv4geotable for each entry in logontable.
> 
> Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
> read the first row out of each logon.IP except, there¹s no way to force
> which country wins over the other without an order by country in the
> OVER() clause as well.
> 
> That said, it will only get slower to produce 1 row per group, because of
> the distributed nature of the SQL engine, the reduction of data happens
> after a ordering shuffle.
> 
> You¹re doing range joins in a SQL engine without theta joins and MapReduce
> had no way to implement those at runtime (Tez has, with EdgeManager
> plugins).
> 
> The easiest/traditional approach out of doing geo-IP lookups is a compact
> UDF model without any joins at all.
> 
> There¹s some old threads on discussing this as a built-in & some code
> (with potential licensing issues) -
> http://markmail.org/message/w54j4upwg2wbh3xg
> 
> Cheers,
> Gopal
> 
>

RE: Limiting outer join

2015-07-07 Thread Bennie Leo

Thanks for your replies.
 
I see how extracting the first country would work, however I was hoping to 
speed up my query by stopping the search once a country has been found.
 
Are you suggesting that I pass the whole IP table to a UDF and perform the 
search myself? I've only programmed simple UDFs so far (ex: reformat a string), 
so any additional details would be appreciated. I am mostly concerned about 
importing said table (currently stored in Hive) and distributing the task 
across nodes (note that I use Tez).
 
Regards,
B
 
> Date: Mon, 6 Jul 2015 18:18:44 -0700
> Subject: Re: Limiting outer join
> From: gop...@apache.org
> To: user@hive.apache.org
> 
> 
> > In the following query, it is possible to limit the amount of entries
> >returned by an outer join to a single value? I want to obtain a single
> >country from ipv4geotable for each entry in logontable.
> 
> Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
> read the first row out of each logon.IP except, there¹s no way to force
> which country wins over the other without an order by country in the
> OVER() clause as well.
> 
> That said, it will only get slower to produce 1 row per group, because of
> the distributed nature of the SQL engine, the reduction of data happens
> after a ordering shuffle.
> 
> You¹re doing range joins in a SQL engine without theta joins and MapReduce
> had no way to implement those at runtime (Tez has, with EdgeManager
> plugins).
> 
> The easiest/traditional approach out of doing geo-IP lookups is a compact
> UDF model without any joins at all.
> 
> There¹s some old threads on discussing this as a built-in & some code
> (with potential licensing issues) -
> http://markmail.org/message/w54j4upwg2wbh3xg
> 
> Cheers,
> Gopal
> 
>

Limiting outer join

2015-07-06 Thread Bennie Leo

 Hi,
 
In the following query, it is possible to limit the amount of entries returned 
by an outer join to a single value? I want to obtain a single country from 
ipv4geotable for each entry in logontable. 

CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM 
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) 
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;
 
For instance, if I had the IP "W.X.Y.Z" in logontable, and that "W.X.Y.Z" fell 
in the range of both Italy and Spain in ipv4geotable, then I would like to 
associate it with Italy only. 
I've tried adding "LIMIT 1" to the second subquery :(SELECT StartIp, EndIp, 
Country FROM ipv4geotable LIMIT 1) ipv4 ON isIpv4(logon.IP)but this is wrong 
since the WHERE clause has to traverse all IPs. Limiting the where clause 
doesn't help either. 
Any ideas?
 
Thank you!
B

RE: Hive indexing optimization

2015-06-30 Thread Bennie Leo

Thank you, I will do that.
B
 
Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Tue, 30 Jun 2015 18:46:50 +






Index doesn’t seems to be kicking in this case.
Please file a bug for this.



Thanks
John





From: Bennie Leo 

Reply-To: "user@hive.apache.org" 

Date: Monday, June 29, 2015 at 5:25 PM

To: "user@hive.apache.org" 

Subject: RE: Hive indexing optimization








I've attached the output. Thanks.

  
B



 


Subject: Re: Hive indexing optimization

From: jpullokka...@hortonworks.com

To: user@hive.apache.org

Date: Mon, 29 Jun 2015 19:17:44 +



Could you post explain extended output?





From: Bennie Leo 

Reply-To: "user@hive.apache.org" 

Date: Monday, June 29, 2015 at 10:35 AM

To: "user@hive.apache.org" 

Subject: RE: Hive indexing optimization







Here is the explain output:

 

STAGE PLANS:

  Stage: Stage-1

Tez

  Edges:

Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)

  Vertices:

Map 1 

Map Operator Tree:

TableScan

  alias: logontable

  filterExpr: isipv4(ip) (type: boolean)

  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE

  Filter Operator

predicate: isipv4(ip) (type: boolean)

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

Select Operator

  expressions: ip (type: bigint)

  outputColumnNames: _col0

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

  Reduce Output Operator

sort order: 

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

value expressions: _col0 (type: bigint)

Map 3 

Map Operator Tree:

TableScan

  alias: ipv4geotable

  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

  Select Operator

expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

Reduce Output Operator

  sort order: 

  Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE

  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)

Reducer 2 

Reduce Operator Tree:

  Join Operator

condition map:

 Left Outer Join0 to 1

condition expressions:

  0 {VALUE._col0} 

  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}

filter predicates:

  0 {isipv4(VALUE._col0)}

  1 

outputColumnNames: _col0, _col1, _col2, _col3

Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE

Filter Operator

  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)

  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

  Select Operator

expressions: _col0 (type: bigint), _col3 (type: string)

outputColumnNames: _col0, _col1

Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

File Output Operator

  compressed: false

  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

Fetch Operator

  limit: -1


 

Thank you,

B

 

> Subject: Re: Hive indexing optimization

> From: jpullokka...@hortonworks.com

> To: user@hive.apache.org

> CC: tben...@hotmail.com

> Date: Sat, 27 Jun 2015 16:02:08 +

> 

> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been

> rewritten as a scan against index table.

> 

> BitMap Indexes seems to support inequalities (<=, <, >=).

> 

> Post the explain plan.

> 

&g

RE: Hive indexing optimization

2015-06-29 Thread Bennie Leo

I've attached the output. Thanks.
  B

 Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Mon, 29 Jun 2015 19:17:44 +






Could you post explain extended output?





From: Bennie Leo 

Reply-To: "user@hive.apache.org" 

Date: Monday, June 29, 2015 at 10:35 AM

To: "user@hive.apache.org" 

Subject: RE: Hive indexing optimization







Here is the explain output:

 

STAGE PLANS:

  Stage: Stage-1

Tez

  Edges:

Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)

  Vertices:

Map 1 

Map Operator Tree:

TableScan

  alias: logontable

  filterExpr: isipv4(ip) (type: boolean)

  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE

  Filter Operator

predicate: isipv4(ip) (type: boolean)

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

Select Operator

  expressions: ip (type: bigint)

  outputColumnNames: _col0

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

  Reduce Output Operator

sort order: 

Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE

value expressions: _col0 (type: bigint)

Map 3 

Map Operator Tree:

TableScan

  alias: ipv4geotable

  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

  Select Operator

expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)

outputColumnNames: _col0, _col1, _col2

Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE

Reduce Output Operator

  sort order: 

  Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE

  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)

Reducer 2 

Reduce Operator Tree:

  Join Operator

condition map:

 Left Outer Join0 to 1

condition expressions:

  0 {VALUE._col0} 

  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}

filter predicates:

  0 {isipv4(VALUE._col0)}

  1 

outputColumnNames: _col0, _col1, _col2, _col3

Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE

Filter Operator

  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)

  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

  Select Operator

expressions: _col0 (type: bigint), _col3 (type: string)

outputColumnNames: _col0, _col1

Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

File Output Operator

  compressed: false

  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

Fetch Operator

  limit: -1


 

Thank you,

B

 

> Subject: Re: Hive indexing optimization

> From: jpullokka...@hortonworks.com

> To: user@hive.apache.org

> CC: tben...@hotmail.com

> Date: Sat, 27 Jun 2015 16:02:08 +

> 

> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been

> rewritten as a scan against index table.

> 

> BitMap Indexes seems to support inequalities (<=, <, >=).

> 

> Post the explain plan.

> 

> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan"  wrote:

> 

> >Hi,

> >

> >Hive indexes won¹t really help you speed up that query right now, because

> >of the plan it generates due to the <= clauses.

> >

> >> CREATETABLE ipv4table

> >> AS

> >> SELECT logon.IP, ipv4.Country

> >> FROM

> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon

> >> LEFT OUTER JOI

RE: Hive indexing optimization

2015-06-29 Thread Bennie Leo

Here is the explain output:
 
STAGE PLANS:
  Stage: Stage-1
Tez
  Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
  Vertices:
Map 1 
Map Operator Tree:
TableScan
  alias: logontable
  filterExpr: isipv4(ip) (type: boolean)
  Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL 
Column stats: NONE
  Filter Operator
predicate: isipv4(ip) (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
Select Operator
  expressions: ip (type: bigint)
  outputColumnNames: _col0
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
  Reduce Output Operator
sort order: 
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE 
Column stats: NONE
value expressions: _col0 (type: bigint)
Map 3 
Map Operator Tree:
TableScan
  alias: ipv4geotable
  Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: startip (type: bigint), endip (type: bigint), 
country (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 41641243 Data size: 5144651200 Basic 
stats: COMPLETE Column stats: NONE
Reduce Output Operator
  sort order: 
  Statistics: Num rows: 41641243 Data size: 5144651200 
Basic stats: COMPLETE Column stats: NONE
  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: string)
Reducer 2 
Reduce Operator Tree:
  Join Operator
condition map:
 Left Outer Join0 to 1
condition expressions:
  0 {VALUE._col0} 
  1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
filter predicates:
  0 {isipv4(VALUE._col0)}
  1 
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 43281312 Data size: 5020632576 Basic 
stats: COMPLETE Column stats: NONE
Filter Operator
  predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: 
boolean)
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  Select Operator
expressions: _col0 (type: bigint), _col3 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 5209034 Data size: 497855986 Basic 
stats: COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
Fetch Operator
  limit: -1
 
Thank you,
B
 
> Subject: Re: Hive indexing optimization
> From: jpullokka...@hortonworks.com
> To: user@hive.apache.org
> CC: tben...@hotmail.com
> Date: Sat, 27 Jun 2015 16:02:08 +
> 
> "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been
> rewritten as a scan against index table.
> 
> BitMap Indexes seems to support inequalities (<=, <, >=).
> 
> Post the explain plan.
> 
> On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan"  wrote:
> 
> >Hi,
> >
> >Hive indexes won¹t really help you speed up that query right now, because
> >of the plan it generates due to the <= clauses.
> >
> >> CREATETABLE ipv4table
> >> AS
> >> SELECT logon.IP, ipv4.Country
> >> FROM
> >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon
> >> LEFT OUTER JOIN
> >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
> >> isIpv4(logon.IP)
> >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp;
> >
> >That¹s a cross-product join, which can¹t be distributed at all & will take
> >forever, even if you use Tez/LLAP.
> >
> >Range join queries have partial distribution rewrites, but AFAIK none of
> >them use filter indexes in hive.
> >
> >But before I suggest a rewrite, can you post the ³explain ;² for
> >the above query, so that I can check if this is indeed producing a
> >cross-product + filter?
> >
> >Cheers,
> >Gopal
> >
> >
> >
> >
> >
> >
> >
> >
>

RE: Hive indexing optimization

2015-06-26 Thread Bennie Leo

Thank you for your response. Unfortunately, my query still takes the same 
amount of time to complete with this new setting. 
 
Would I have to do something like:
 
"INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` 
FROM default__t_x__ where j='and';
(The name default__t_x__ can be found in the output of step 2. Also, 
/tmp/indexes directory needs to exist in HDFS. You can substitute this to be 
any pre-existing directory in HDFS)
SET hive.index.compact.file=/tmp/indexes/x;
SET 
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;"
 ?
 
I don't know how I could include this within my current query.
 
Cheers,
B
 
Subject: Re: Hive indexing optimization
From: jpullokka...@hortonworks.com
To: user@hive.apache.org
Date: Fri, 26 Jun 2015 01:27:21 +






Set hive.optimize.index.filter=true;



Thanks
John





From: Bennie Leo 

Reply-To: "user@hive.apache.org" 

Date: Thursday, June 25, 2015 at 5:48 PM

To: "user@hive.apache.org" 

Subject: Hive indexing optimization










Hi,

 

I am attempting to optimize a query using indexing. My current query converts 
an ipv4 address to a country using a geolocation table. However, the 
geolocation table is fairly large and the query takes an impractical amount of 
time. I have created indexes and
 set the binary search parameter to true (default), but the query is not 
faster. 

 

Here is how I set up indexing:



DROPINDEXIFEXISTS
 ipv4indexes ON ipv4geotable;

CREATEINDEX
 ipv4indexes 

ONTABLE
 ipv4geotable (StartIp, EndIp)

AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

WITHDEFERREDREBUILD

IDXPROPERTIES ('hive.index.compact.binary.search'='true');


 


ALTERINDEX
 ipv4indexes ON ipv4geotable
REBUILD;


 

And here is my query:

 

DROPTABLEIFEXISTS
 ipv4table;

CREATETABLE
 ipv4table AS

SELECT logon.IP, ipv4.Country

FROM

(SELECT
 * FROM logontable
WHERE isIpv4(IP)) logon

LEFTOUTERJOIN

(SELECT
 StartIp, EndIp, Country FROM
 ipv4geotable) ipv4 ON
 isIpv4(logon.IP) 

WHERE ipv4.StartIp <=
logon.IPANDlogon.IP
 <= ipv4.EndIp;


 

What the query is doing is extracting an IP from logontable and finding in 
which range it lies within the geolocation table (which is sorted). When a 
range is found, the corresponding country is returned. The problem is that
 Hive goes through the whole table row by row rather than performing a smart 
search (ex: binary search).


 

Any suggestions on how to speed things up? 

 

Thank you,

B

Hive indexing optimization

2015-06-25 Thread Bennie Leo





Hi,
 
I am attempting to optimize a query using indexing. My current query converts 
an ipv4 address to a country using a geolocation table. However, the 
geolocation table is fairly large and the query takes an impractical amount of 
time. I have created indexes and set the binary search parameter to true 
(default), but the query is not faster. 
 
Here is how I set up indexing:


DROP INDEX IF EXISTS ipv4indexes ON ipv4geotable;
CREATE INDEX ipv4indexes 
ON TABLE ipv4geotable (StartIp, EndIp)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES  ('hive.index.compact.binary.search'='true');
 
ALTER INDEX ipv4indexes ON ipv4geotable REBUILD;
 
And here is my query:
 
DROP TABLE IF EXISTS ipv4table;
CREATE TABLE ipv4table AS
SELECT logon.IP, ipv4.Country
FROM 
(SELECT * FROM logontable WHERE isIpv4(IP)) logon
LEFT OUTER JOIN
(SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) 
WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp;
 
What the query is doing is extracting an IP from logontable and finding in 
which range it lies within the geolocation table (which is sorted). When a 
range is found, the corresponding country is returned. The problem is that Hive 
goes through the whole table row by row rather than performing a smart search 
(ex: binary search). 
 
Any suggestions on how to speed things up? 
 
Thank you,
B

RE: Optimizing UDF

Optimizing UDF

RE: Limiting outer join

RE: Limiting outer join

RE: Limiting outer join

Limiting outer join

RE: Hive indexing optimization

RE: Hive indexing optimization

RE: Hive indexing optimization

RE: Hive indexing optimization

Hive indexing optimization

11 matches

Site Navigation

Mail list logo

Footer information