RE: Optimizing UDF
Thanks for your reply. I am already using Tez (sorry, forgot to mention this), and my goal is indeed to build the instance once per container. I'm sorry I don't understand what the solution would be with Tez. Are you saying that the object should be a private final? The only element I would be missing in this case is the final keyword. I fail to see how this will make a difference... Thanks, B > Date: Tue, 14 Jul 2015 15:19:16 -0700 > Subject: Re: Optimizing UDF > From: gop...@apache.org > To: user@hive.apache.org > CC: tben...@hotmail.com > > > > > I'm trying to optimize a UDF that runs very slowly on Hive. The UDF > >takes in a 5GB table and builds a large data structure out of it to > >facilitate lookups. The 5GB input is loaded into the distributed cache > >with an 'add file ' command, and the UDF builds > > the data structure a single time per instance (or so it should). > > No, this builds it once per map attempt in MRv2, because each JVM is > killed after executing a single map attempt. > > In Tez, however you can build this once per container (usually, a ~10x > perf improvement). > > This has a fix in Tez, since the UDFs can only load it over the network > once per JVM init and you can hang onto that in the loaded GenericUDF > object (*not* a static, but a private final), which is held in the > TezCache as long as the task keeps running the same vertex. > > That will be thrown away whenever the container switches over to running a > reducer, so the cache is transient. > > Cheers, > Gopal > >
Optimizing UDF
Hi, I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a 5GB table and builds a large data structure out of it to facilitate lookups. The 5GB input is loaded into the distributed cache with an 'add file ' command, and the UDF builds the data structure a single time per instance (or so it should). My problem is that the Hive UDF takes several hours to complete, while running the exact same code on my local machine takes 5 minutes! What could be causing Hive to be so impractically slow? According to the Hive logs, the data transfer takes 5-10 minutes, which is reasonable. What else is taking so long? Thanks, B
RE: Limiting outer join
It went from about 60 mins to 3 mins. Hive was traversing the whole table multiple times, which is obviously inefficient! > Date: Tue, 7 Jul 2015 15:55:19 -0700 > Subject: Re: Limiting outer join > From: gop...@apache.org > To: user@hive.apache.org > > > > Never mind, I got it working with UDF. I just pass the file location to > >my evaluate function. Thanks! :) > > Nice. Would be very interested in looking at performance of such a UDF, if > you have numbers before/after. > > I suspect it will be a magnitude or more faster than the BETWEEN/JOIN > clauses. > > Cheers, > Gopal > >
RE: Limiting outer join
Never mind, I got it working with UDF. I just pass the file location to my evaluate function. Thanks! :) From: tben...@hotmail.com To: user@hive.apache.org Subject: RE: Limiting outer join Date: Tue, 7 Jul 2015 09:59:22 -0700 Thanks for your replies. I see how extracting the first country would work, however I was hoping to speed up my query by stopping the search once a country has been found. Are you suggesting that I pass the whole IP table to a UDF and perform the search myself? I've only programmed simple UDFs so far (ex: reformat a string), so any additional details would be appreciated. I am mostly concerned about importing said table (currently stored in Hive) and distributing the task across nodes (note that I use Tez). Regards, B > Date: Mon, 6 Jul 2015 18:18:44 -0700 > Subject: Re: Limiting outer join > From: gop...@apache.org > To: user@hive.apache.org > > > > In the following query, it is possible to limit the amount of entries > >returned by an outer join to a single value? I want to obtain a single > >country from ipv4geotable for each entry in logontable. > > Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can > read the first row out of each logon.IP except, there¹s no way to force > which country wins over the other without an order by country in the > OVER() clause as well. > > That said, it will only get slower to produce 1 row per group, because of > the distributed nature of the SQL engine, the reduction of data happens > after a ordering shuffle. > > You¹re doing range joins in a SQL engine without theta joins and MapReduce > had no way to implement those at runtime (Tez has, with EdgeManager > plugins). > > The easiest/traditional approach out of doing geo-IP lookups is a compact > UDF model without any joins at all. > > There¹s some old threads on discussing this as a built-in & some code > (with potential licensing issues) - > http://markmail.org/message/w54j4upwg2wbh3xg > > Cheers, > Gopal > >
RE: Limiting outer join
Thanks for your replies. I see how extracting the first country would work, however I was hoping to speed up my query by stopping the search once a country has been found. Are you suggesting that I pass the whole IP table to a UDF and perform the search myself? I've only programmed simple UDFs so far (ex: reformat a string), so any additional details would be appreciated. I am mostly concerned about importing said table (currently stored in Hive) and distributing the task across nodes (note that I use Tez). Regards, B > Date: Mon, 6 Jul 2015 18:18:44 -0700 > Subject: Re: Limiting outer join > From: gop...@apache.org > To: user@hive.apache.org > > > > In the following query, it is possible to limit the amount of entries > >returned by an outer join to a single value? I want to obtain a single > >country from ipv4geotable for each entry in logontable. > > Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can > read the first row out of each logon.IP except, there¹s no way to force > which country wins over the other without an order by country in the > OVER() clause as well. > > That said, it will only get slower to produce 1 row per group, because of > the distributed nature of the SQL engine, the reduction of data happens > after a ordering shuffle. > > You¹re doing range joins in a SQL engine without theta joins and MapReduce > had no way to implement those at runtime (Tez has, with EdgeManager > plugins). > > The easiest/traditional approach out of doing geo-IP lookups is a compact > UDF model without any joins at all. > > There¹s some old threads on discussing this as a built-in & some code > (with potential licensing issues) - > http://markmail.org/message/w54j4upwg2wbh3xg > > Cheers, > Gopal > >
Limiting outer join
Hi, In the following query, it is possible to limit the amount of entries returned by an outer join to a single value? I want to obtain a single country from ipv4geotable for each entry in logontable. CREATE TABLE ipv4table AS SELECT logon.IP, ipv4.Country FROM (SELECT * FROM logontable WHERE isIpv4(IP)) logon LEFT OUTER JOIN (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp; For instance, if I had the IP "W.X.Y.Z" in logontable, and that "W.X.Y.Z" fell in the range of both Italy and Spain in ipv4geotable, then I would like to associate it with Italy only. I've tried adding "LIMIT 1" to the second subquery :(SELECT StartIp, EndIp, Country FROM ipv4geotable LIMIT 1) ipv4 ON isIpv4(logon.IP)but this is wrong since the WHERE clause has to traverse all IPs. Limiting the where clause doesn't help either. Any ideas? Thank you! B
RE: Hive indexing optimization
Thank you, I will do that. B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Tue, 30 Jun 2015 18:46:50 + Index doesn’t seems to be kicking in this case. Please file a bug for this. Thanks John From: Bennie Leo Reply-To: "user@hive.apache.org" Date: Monday, June 29, 2015 at 5:25 PM To: "user@hive.apache.org" Subject: RE: Hive indexing optimization I've attached the output. Thanks. B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Mon, 29 Jun 2015 19:17:44 + Could you post explain extended output? From: Bennie Leo Reply-To: "user@hive.apache.org" Date: Monday, June 29, 2015 at 10:35 AM To: "user@hive.apache.org" Subject: RE: Hive indexing optimization Here is the explain output: STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) Vertices: Map 1 Map Operator Tree: TableScan alias: logontable filterExpr: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: ip (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: _col0 (type: bigint) Map 3 Map Operator Tree: TableScan alias: ipv4geotable Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: startip (type: bigint), endip (type: bigint), country (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: string) Reducer 2 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} filter predicates: 0 {isipv4(VALUE._col0)} 1 outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean) Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint), _col3 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thank you, B > Subject: Re: Hive indexing optimization > From: jpullokka...@hortonworks.com > To: user@hive.apache.org > CC: tben...@hotmail.com > Date: Sat, 27 Jun 2015 16:02:08 + > > "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been > rewritten as a scan against index table. > > BitMap Indexes seems to support inequalities (<=, <, >=). > > Post the explain plan. > &g
RE: Hive indexing optimization
I've attached the output. Thanks. B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Mon, 29 Jun 2015 19:17:44 + Could you post explain extended output? From: Bennie Leo Reply-To: "user@hive.apache.org" Date: Monday, June 29, 2015 at 10:35 AM To: "user@hive.apache.org" Subject: RE: Hive indexing optimization Here is the explain output: STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) Vertices: Map 1 Map Operator Tree: TableScan alias: logontable filterExpr: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: ip (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: _col0 (type: bigint) Map 3 Map Operator Tree: TableScan alias: ipv4geotable Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: startip (type: bigint), endip (type: bigint), country (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: string) Reducer 2 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} filter predicates: 0 {isipv4(VALUE._col0)} 1 outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean) Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint), _col3 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thank you, B > Subject: Re: Hive indexing optimization > From: jpullokka...@hortonworks.com > To: user@hive.apache.org > CC: tben...@hotmail.com > Date: Sat, 27 Jun 2015 16:02:08 + > > "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been > rewritten as a scan against index table. > > BitMap Indexes seems to support inequalities (<=, <, >=). > > Post the explain plan. > > On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" wrote: > > >Hi, > > > >Hive indexes won¹t really help you speed up that query right now, because > >of the plan it generates due to the <= clauses. > > > >> CREATETABLE ipv4table > >> AS > >> SELECT logon.IP, ipv4.Country > >> FROM > >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon > >> LEFT OUTER JOI
RE: Hive indexing optimization
Here is the explain output: STAGE PLANS: Stage: Stage-1 Tez Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE) Vertices: Map 1 Map Operator Tree: TableScan alias: logontable filterExpr: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 550 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: isipv4(ip) (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Select Operator expressions: ip (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE value expressions: _col0 (type: bigint) Map 3 Map Operator Tree: TableScan alias: ipv4geotable Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: startip (type: bigint), endip (type: bigint), country (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 41641243 Data size: 5144651200 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: string) Reducer 2 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} {VALUE._col2} filter predicates: 0 {isipv4(VALUE._col0)} 1 outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 43281312 Data size: 5020632576 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((_col1 <= _col0) and (_col0 <= _col2)) (type: boolean) Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: bigint), _col3 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 5209034 Data size: 497855986 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Thank you, B > Subject: Re: Hive indexing optimization > From: jpullokka...@hortonworks.com > To: user@hive.apache.org > CC: tben...@hotmail.com > Date: Sat, 27 Jun 2015 16:02:08 + > > "SELECT StartIp, EndIp, Country FROM ipv4geotable” should have been > rewritten as a scan against index table. > > BitMap Indexes seems to support inequalities (<=, <, >=). > > Post the explain plan. > > On 6/26/15, 8:56 PM, "Gopal Vijayaraghavan" wrote: > > >Hi, > > > >Hive indexes won¹t really help you speed up that query right now, because > >of the plan it generates due to the <= clauses. > > > >> CREATETABLE ipv4table > >> AS > >> SELECT logon.IP, ipv4.Country > >> FROM > >> (SELECT * FROM logontable WHERE isIpv4(IP)) logon > >> LEFT OUTER JOIN > >> (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON > >> isIpv4(logon.IP) > >> WHERE ipv4.StartIp <=logon.IP AND logon.IP <= ipv4.EndIp; > > > >That¹s a cross-product join, which can¹t be distributed at all & will take > >forever, even if you use Tez/LLAP. > > > >Range join queries have partial distribution rewrites, but AFAIK none of > >them use filter indexes in hive. > > > >But before I suggest a rewrite, can you post the ³explain ;² for > >the above query, so that I can check if this is indeed producing a > >cross-product + filter? > > > >Cheers, > >Gopal > > > > > > > > > > > > > > > > >
RE: Hive indexing optimization
Thank you for your response. Unfortunately, my query still takes the same amount of time to complete with this new setting. Would I have to do something like: "INSERT OVERWRITE DIRECTORY '/tmp/indexes/x' SELECT `_bucketname`, `_offsets` FROM default__t_x__ where j='and'; (The name default__t_x__ can be found in the output of step 2. Also, /tmp/indexes directory needs to exist in HDFS. You can substitute this to be any pre-existing directory in HDFS) SET hive.index.compact.file=/tmp/indexes/x; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;" ? I don't know how I could include this within my current query. Cheers, B Subject: Re: Hive indexing optimization From: jpullokka...@hortonworks.com To: user@hive.apache.org Date: Fri, 26 Jun 2015 01:27:21 + Set hive.optimize.index.filter=true; Thanks John From: Bennie Leo Reply-To: "user@hive.apache.org" Date: Thursday, June 25, 2015 at 5:48 PM To: "user@hive.apache.org" Subject: Hive indexing optimization Hi, I am attempting to optimize a query using indexing. My current query converts an ipv4 address to a country using a geolocation table. However, the geolocation table is fairly large and the query takes an impractical amount of time. I have created indexes and set the binary search parameter to true (default), but the query is not faster. Here is how I set up indexing: DROPINDEXIFEXISTS ipv4indexes ON ipv4geotable; CREATEINDEX ipv4indexes ONTABLE ipv4geotable (StartIp, EndIp) AS'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITHDEFERREDREBUILD IDXPROPERTIES ('hive.index.compact.binary.search'='true'); ALTERINDEX ipv4indexes ON ipv4geotable REBUILD; And here is my query: DROPTABLEIFEXISTS ipv4table; CREATETABLE ipv4table AS SELECT logon.IP, ipv4.Country FROM (SELECT * FROM logontable WHERE isIpv4(IP)) logon LEFTOUTERJOIN (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) WHERE ipv4.StartIp <= logon.IPANDlogon.IP <= ipv4.EndIp; What the query is doing is extracting an IP from logontable and finding in which range it lies within the geolocation table (which is sorted). When a range is found, the corresponding country is returned. The problem is that Hive goes through the whole table row by row rather than performing a smart search (ex: binary search). Any suggestions on how to speed things up? Thank you, B
Hive indexing optimization
Hi, I am attempting to optimize a query using indexing. My current query converts an ipv4 address to a country using a geolocation table. However, the geolocation table is fairly large and the query takes an impractical amount of time. I have created indexes and set the binary search parameter to true (default), but the query is not faster. Here is how I set up indexing: DROP INDEX IF EXISTS ipv4indexes ON ipv4geotable; CREATE INDEX ipv4indexes ON TABLE ipv4geotable (StartIp, EndIp) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD IDXPROPERTIES ('hive.index.compact.binary.search'='true'); ALTER INDEX ipv4indexes ON ipv4geotable REBUILD; And here is my query: DROP TABLE IF EXISTS ipv4table; CREATE TABLE ipv4table AS SELECT logon.IP, ipv4.Country FROM (SELECT * FROM logontable WHERE isIpv4(IP)) logon LEFT OUTER JOIN (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON isIpv4(logon.IP) WHERE ipv4.StartIp <= logon.IP AND logon.IP <= ipv4.EndIp; What the query is doing is extracting an IP from logontable and finding in which range it lies within the geolocation table (which is sorted). When a range is found, the corresponding country is returned. The problem is that Hive goes through the whole table row by row rather than performing a smart search (ex: binary search). Any suggestions on how to speed things up? Thank you, B