Re: [VOTE] Bylaws change to allow some commits without review
+1 On Wed, Apr 20, 2016 at 1:24 AM, Jimmy Xiang wrote: > +1 > > On Tue, Apr 19, 2016 at 2:58 PM, Alpesh Patel > wrote: > > +1 > > > > On Tue, Apr 19, 2016 at 1:29 PM, Lars Francke > > wrote: > >> > >> Thanks everyone! Vote runs for at least one more day. I'd appreciate it > if > >> you could ping/bump your colleagues to chime in here. > >> > >> I'm not entirely sure how many PMC members are active and how many votes > >> we need but I think a few more are probably needed. > >> > >> On Mon, Apr 18, 2016 at 8:02 PM, Thejas Nair > >> wrote: > >>> > >>> +1 > >>> > >>> > >>> From: Wei Zheng > >>> Sent: Monday, April 18, 2016 10:51 AM > >>> To: user@hive.apache.org > >>> Subject: Re: [VOTE] Bylaws change to allow some commits without review > >>> > >>> +1 > >>> > >>> Thanks, > >>> Wei > >>> > >>> From: Siddharth Seth > >>> Reply-To: "user@hive.apache.org" > >>> Date: Monday, April 18, 2016 at 10:29 > >>> To: "user@hive.apache.org" > >>> Subject: Re: [VOTE] Bylaws change to allow some commits without review > >>> > >>> +1 > >>> > >>> On Wed, Apr 13, 2016 at 3:58 PM, Lars Francke > >>> wrote: > > Hi everyone, > > we had a discussion on the dev@ list about allowing some forms of > contributions to be committed without a review. > > The exact sentence I propose to add is: "Minor issues (e.g. typos, > code > style issues, JavaDoc changes. At committer's discretion) can be > committed > after soliciting feedback/review on the mailing list and not receiving > feedback within 2 days." > > The proposed bylaws can also be seen here > < > https://cwiki.apache.org/confluence/display/Hive/Proposed+Changes+to+Hive+Project+Bylaws+-+April+2016 > > > > This vote requires a 2/3 majority of all Active PMC members so I'd > love > to get as many votes as possible. The vote will run for at least six > days. > > Thanks, > Lars > >>> > >>> > >> > > >
Writing HFiles using Hive for an HBase bulk load - possible bug?
Hi folks, I am trying to create HFiles from a Hive table to bulk load into HBase and am following the HWX [1] tutorial. It creates the HFiles correctly but then fails when closing the RecordWriter with the following stack trace. Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: java.io.IOException: Multiple family directories found in hdfs:// c1n1.gbif.org:8020/user/hive/warehouse/tim.db/coords_hbase/_temporary/2/_temporary at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:295) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) The reason is that the HFiles are created in the task attempt folder but it is looking 2 directories above the task attempt for the HFile. Does anyone else see this behaviour please? I logged it as a bug [2], which also details exactly my procedure, but I wonder if someone could confirm that they also see this, or if perhaps I am just doing something wrong and it works for them? Thanks all, Tim [1] https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.html [2] https://issues.apache.org/jira/browse/HIVE-13539
Re: UDF error?
Here is an example of a no arg that will return a different value for each row: https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UuidUDF.java Hope this helps, Tim On Mon, Sep 30, 2013 at 10:59 PM, Yang wrote: > thanks! at first I did have a no-arg evaluate(), but somehow > > select myfunction(), field1, field2 from mytable ; > > spits out the same value for myfunction() for each row. so I was wondering > whether the UDF got called only 1 time, because the hive compiler sees that > the argument is void, so that > all the invocations would be "having the same value", then I tried to pass > in a param to prevent this possibility. > > > On Mon, Sep 30, 2013 at 1:55 PM, Tim Robertson > wrote: > >> It's been ages since I wrote one, but the differences to mine: >> >> a) I use LongWritable: public LongWritable evaluate(LongWritable >> startAt) { >> b) I have annotations on the class (but I think they are just for docs) >> @Description(name = "row_sequence", >> value = "_FUNC_() - Returns a generated row sequence number starting >> from 1") >> @UDFType(deterministic = false) >> public class UDFRowSequence extends UDF { >> >> Hope this helps! >> Tim >> >> >> >> On Mon, Sep 30, 2013 at 10:47 PM, Yang wrote: >> >>> I wrote a super simple UDF, but got some errors: >>> >>> UDF: >>> >>> package yy; >>> import org.apache.hadoop.hive.ql.exec.UDF; >>> import java.util.Random; >>> import java.util.UUID; >>> import java.lang.management.*; >>> >>> public class MyUdf extends UDF { >>> static Random rand = new Random(System.currentTimeMillis() + >>> Thread.currentThread().getId()* 100); >>> String name = ManagementFactory.getRuntimeMXBean().getName(); >>> long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) * >>> 1 + Thread.currentThread().getId() * 1000; >>> public long evaluate(long x ) { >>> //return (long)UUID.randomUUID().hashCode(); >>> //return rand.nextLong(); >>> return startValue++; >>> } >>> } >>> >>> >>> >>> >>> >>> sql script: >>> >>> CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; >>> select gen_uniq2(field1), field2 >>> from yy_mapping limit 10; >>> >>> field1 is bigint, field2 is int >>> >>> >>> >>> >>> >>> error: >>> >>> hive> source aa.sql; >>> Added ./MyUdf.jar to class path >>> Added resource: ./MyUdf.jar >>> OK >>> Time taken: 0.0070 seconds >>> FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments >>> 'field1': No matching method for class yy.MyUdf with (bigint). Possible >>> choices: _FUNC_() >>> >>> >>> >>> >>> >>> so I'm declaring a UDF with arg of long, so that should work for a >>> bigint (more importantly it's complaining not long vs bigint, but bigint vs >>> void ). I tried changing both to int, same failure >>> >>> >>> thanks! >>> yang >>> >>> >>> >> >
Re: UDF error?
That class is: https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UDFRowSequence.java Cheers, Tim On Mon, Sep 30, 2013 at 10:55 PM, Tim Robertson wrote: > It's been ages since I wrote one, but the differences to mine: > > a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) > { > b) I have annotations on the class (but I think they are just for docs) > @Description(name = "row_sequence", > value = "_FUNC_() - Returns a generated row sequence number starting > from 1") > @UDFType(deterministic = false) > public class UDFRowSequence extends UDF { > > Hope this helps! > Tim > > > > On Mon, Sep 30, 2013 at 10:47 PM, Yang wrote: > >> I wrote a super simple UDF, but got some errors: >> >> UDF: >> >> package yy; >> import org.apache.hadoop.hive.ql.exec.UDF; >> import java.util.Random; >> import java.util.UUID; >> import java.lang.management.*; >> >> public class MyUdf extends UDF { >> static Random rand = new Random(System.currentTimeMillis() + >> Thread.currentThread().getId()* 100); >> String name = ManagementFactory.getRuntimeMXBean().getName(); >> long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) * >> 1 + Thread.currentThread().getId() * 1000; >> public long evaluate(long x ) { >> //return (long)UUID.randomUUID().hashCode(); >> //return rand.nextLong(); >> return startValue++; >> } >> } >> >> >> >> >> >> sql script: >> >> CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; >> select gen_uniq2(field1), field2 >> from yy_mapping limit 10; >> >> field1 is bigint, field2 is int >> >> >> >> >> >> error: >> >> hive> source aa.sql; >> Added ./MyUdf.jar to class path >> Added resource: ./MyUdf.jar >> OK >> Time taken: 0.0070 seconds >> FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments >> 'field1': No matching method for class yy.MyUdf with (bigint). Possible >> choices: _FUNC_() >> >> >> >> >> >> so I'm declaring a UDF with arg of long, so that should work for a bigint >> (more importantly it's complaining not long vs bigint, but bigint vs void >> ). I tried changing both to int, same failure >> >> >> thanks! >> yang >> >> >> >
Re: UDF error?
It's been ages since I wrote one, but the differences to mine: a) I use LongWritable: public LongWritable evaluate(LongWritable startAt) { b) I have annotations on the class (but I think they are just for docs) @Description(name = "row_sequence", value = "_FUNC_() - Returns a generated row sequence number starting from 1") @UDFType(deterministic = false) public class UDFRowSequence extends UDF { Hope this helps! Tim On Mon, Sep 30, 2013 at 10:47 PM, Yang wrote: > I wrote a super simple UDF, but got some errors: > > UDF: > > package yy; > import org.apache.hadoop.hive.ql.exec.UDF; > import java.util.Random; > import java.util.UUID; > import java.lang.management.*; > > public class MyUdf extends UDF { > static Random rand = new Random(System.currentTimeMillis() + > Thread.currentThread().getId()* 100); > String name = ManagementFactory.getRuntimeMXBean().getName(); > long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) * > 1 + Thread.currentThread().getId() * 1000; > public long evaluate(long x ) { > //return (long)UUID.randomUUID().hashCode(); > //return rand.nextLong(); > return startValue++; > } > } > > > > > > sql script: > > CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf'; > select gen_uniq2(field1), field2 > from yy_mapping limit 10; > > field1 is bigint, field2 is int > > > > > > error: > > hive> source aa.sql; > Added ./MyUdf.jar to class path > Added resource: ./MyUdf.jar > OK > Time taken: 0.0070 seconds > FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments > 'field1': No matching method for class yy.MyUdf with (bigint). Possible > choices: _FUNC_() > > > > > > so I'm declaring a UDF with arg of long, so that should work for a bigint > (more importantly it's complaining not long vs bigint, but bigint vs void > ). I tried changing both to int, same failure > > > thanks! > yang > > >
Re: Questions about Hive
I don't think Hive is intended for web request scoped operations... that would be a rather unusual case from my understanding. HBase sounds more like the Hadoop equivalent that you might be looking for, but you need to look at your search patterns to see if HBase is a good fit (you need to manage your own indexes again). Cheers, Tim On Mon, Sep 17, 2012 at 8:07 AM, Something Something < mailinglist...@gmail.com> wrote: > Thank you both for the answers. We are trying to find out if Hive can be > used as a replacement of Netezza, but if there are no indexes then I don't > see how it will beat Netezza in terms of performance. Sounds like it > certainly can't be used to do a quick lookup from a webapp - like Netezza > can. > > If performance isn't a concern, then I guess it could be a useful tool. > Will try it out & see how it works out. Thanks. > > > > On Sun, Sep 16, 2012 at 10:51 PM, Tim Robertson > wrote: > >> Note: I am a newbie to Hive. >>> >>> Can someone please answer the following questions? >>> >>> 1) Does Hive provide APIs (like HBase does) that can be used to >>> retrieve data from the tables in Hive from a Java program? I heard >>> somewhere that the data can be accessed with JDBC (style) APIs. True? >>> >> >> True. >> https://cwiki.apache.org/Hive/hiveclient.html#HiveClient-JDBC >> >> >>> 2) I don't see how I can add indexes on the tables, so does that mean a >>> query such as the following will trigger a MR job that will search files on >>> HDFS sequentially? >>> >>> hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; >>> >>> >> There are some index implementations in hive, but it is not as simple as >> a traditional db. >> E.g. Search Jira and see some of the work: >> https://issues.apache.org/jira/browse/HIVE-417 >> >> You are correct that the above would do a full table scan >> >> 3) Has anyone compared performance of Hive against other NOSQL databases >>> such as HBase, MongoDB. I understand it's not exactly apples to apples >>> comparison, but still... >>> >> >> I think you misunderstand what Hive is. It is a basically a SQL to MR >> translation engine, which has adapters for the input source. By default it >> uses simple files on the HDFS, but there is (e.g.) HBase adapters, so you >> can use it to run SQL on HBase tables for example (which works great). >> Regarding performance, on the HBase scans, the operation is the same as >> running a normal HBase MR scan, so is the same. >> >> >>> >>> Thanks. >> >> >> >
Re: Questions about Hive
> > Note: I am a newbie to Hive. > > Can someone please answer the following questions? > > 1) Does Hive provide APIs (like HBase does) that can be used to retrieve > data from the tables in Hive from a Java program? I heard somewhere that > the data can be accessed with JDBC (style) APIs. True? > True. https://cwiki.apache.org/Hive/hiveclient.html#HiveClient-JDBC > 2) I don't see how I can add indexes on the tables, so does that mean a > query such as the following will trigger a MR job that will search files on > HDFS sequentially? > > hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; > > There are some index implementations in hive, but it is not as simple as a traditional db. E.g. Search Jira and see some of the work: https://issues.apache.org/jira/browse/HIVE-417 You are correct that the above would do a full table scan 3) Has anyone compared performance of Hive against other NOSQL databases > such as HBase, MongoDB. I understand it's not exactly apples to apples > comparison, but still... > I think you misunderstand what Hive is. It is a basically a SQL to MR translation engine, which has adapters for the input source. By default it uses simple files on the HDFS, but there is (e.g.) HBase adapters, so you can use it to run SQL on HBase tables for example (which works great). Regarding performance, on the HBase scans, the operation is the same as running a normal HBase MR scan, so is the same. > > Thanks.
Data-local map tasks - reported correctly in JT?
Hi all, I have a 6 node cluster, and on a simple query created with a table from a CSV, I was seeing a lot of mappers reporting that they were not using data locality. I changed the replication factor to 6 but still MR is showing only about 60% data locality in the data-local map tasks. How can this be when I have no under replicated blocks, and replication count the same as the machine count? Am I missing something? Does it indicate that something is wrong in the MR configuration (E.g. A TT not recognizing localhost for DN for example)? The 6 machines each have 12 spindles in them and I'm running Hive 0.7 and 0.9 trunk built about 2 weeks ago. Many thanks! Tim
Re: Hive can't find some tables
It sounds like you have run Sqoop without specifying a durable metastore for Hive. E.g. you haven't told Hive to use MySQL, PostGRES etc to store it's metadata. It probably used Derby DB which either put it all in memory, or put it all on the /tmp directory, which was destroyed on restart. I would imagine you need to sqoop in again, after you correct this. HTH, Tim On Wed, Apr 18, 2012 at 5:29 AM, Đỗ Hoàng Khiêm wrote: > HI, I have some problems with Hive, looks like Hive cannot read some of my > tables which was imported before by Sqoop. After importing from Sqoop it's > still able to read/query from these tables, but after I restarted machine, > Hive cannot find these tables anymore (as *show tables* didn't list these > tables) while it's still able to create new tables. I have checked in HDFS, > these table files still existed at /user/hive/warehouse/. > > When start hive shell then try to show tables, the hive log printed out: > > 2012-04-18 10:25:11,412 WARN conf.HiveConf (HiveConf.java:(68)) - > hive-site.xml not found on CLASSPATH > 2012-04-18 10:25:43,239 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.core.resources" but it cannot be resolved. > 2012-04-18 10:25:43,239 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.core.resources" but it cannot be resolved. > 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.core.runtime" but it cannot be resolved. > 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.core.runtime" but it cannot be resolved. > 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.text" but it cannot be resolved. > 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin > (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires > "org.eclipse.text" but it cannot be resolved. > > How can I recover these table? > > Thanks. > > >
Re: Can we define external table Fields enclosed in "
Hi again, How about defining a table (t1) with ~ as the delimiter and then creating a view to that table which uses the regexp_replace UDF? CREATE VIEW v_user_access AS SELECT regexp_replace(ip, "\"", "") as ip, ... FROM t1; Not sure the implications on joining, but basic queries should work ok I would think. HTH, Tim On Wed, Apr 18, 2012 at 9:20 PM, Gopi Kodumur wrote: > Thanks Tim, Sorry for not explaining the problem clearly... > > I have data in this format , I wanted to store the data in Ext-Hive table > without the Double Quote > > > "127.0.0.17"~"444c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d > > "127.0.0.12"~"544c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d > > "127.0.0.13"~"644c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d > > > -- > *From:* Tim Robertson > *To:* user@hive.apache.org; Gopi Kodumur > *Sent:* Wednesday, April 18, 2012 12:14 PM > *Subject:* Re: Can we define external table Fields enclosed in " > > I believe so. From the tutorial [1] : > >CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, > page_url STRING, referrer_url STRING, > ip STRING COMMENT 'IP Address of the User', > country STRING COMMENT 'country of origination') > COMMENT 'This is the staging page view table' > ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' > > > HTH, > Tim > > [1] https://cwiki.apache.org/confluence/display/Hive/Tutorial > > > > On Tue, Apr 17, 2012 at 11:20 PM, Gopi Kodumur wrote: > > Is it possible to specify enclosed by character " for fields , while > defining external table > > Thanks > Gopi > > > > >
Re: Can we define external table Fields enclosed in "
I believe so. From the tutorial [1] : CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' HTH, Tim [1] https://cwiki.apache.org/confluence/display/Hive/Tutorial On Tue, Apr 17, 2012 at 11:20 PM, Gopi Kodumur wrote: > Is it possible to specify enclosed by character " for fields , while > defining external table > > Thanks > Gopi > >
Re: Help with INTEGER on HBase (Hive-1634)
Apologies, it does indeed work when you add the correct JARs in Hive. Tim On Tue, Apr 17, 2012 at 3:33 PM, Tim Robertson wrote: > Hi all, > > I am *really* interested in Hive-1634 ( > https://issues.apache.org/jira/browse/HIVE-1634). I have just built from > Hive trunk using HBase 0.90.4 as the version (e.g. we run cdh3u2). > > We have an HBase table populated with Bytes, so I create the Hive table > like so: > > CREATE EXTERNAL TABLE tim_hbase_occurrence ( > id int, > scientific_name string, > data_resource_id int > ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH > SERDEPROPERTIES ( > "hbase.columns.mapping" = ":key,v:scientific_name,v:data_resource_id", > "hbase.columns.storage.types" = "b,b,b" > ) TBLPROPERTIES( > "hbase.table.name" = "mini_occurrences", > "hbase.table.default.storage.type" = "binary" > ); > > This suggests it understands the formats: > > hive> SELECT * FROM tim_hbase_occurrence LIMIT 3; > OK > 1444 Abies alba 1081 > 1445 Abies alba 1081 > 1446 Abies alba 1081 > > But doing any queries, suggest not: > > hive> SELECT * FROM tim_hbase_occurrence WHERE scientific_name='Abies > alba' limit 3; > ... > NULL Abies alba NULL > NULL Abies alba NULL > NULL Abies alba NULL > Time taken: 9.668 seconds > > hive> SELECT * FROM tim_hbase_occurrence WHERE data_resource_id=1081; > ... > 0 (no records) > > Can anyone provide any guidance on this please? > > Thanks! > Tim > >
Help with INTEGER on HBase (Hive-1634)
Hi all, I am *really* interested in Hive-1634 ( https://issues.apache.org/jira/browse/HIVE-1634). I have just built from Hive trunk using HBase 0.90.4 as the version (e.g. we run cdh3u2). We have an HBase table populated with Bytes, so I create the Hive table like so: CREATE EXTERNAL TABLE tim_hbase_occurrence ( id int, scientific_name string, data_resource_id int ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,v:scientific_name,v:data_resource_id", "hbase.columns.storage.types" = "b,b,b" ) TBLPROPERTIES( "hbase.table.name" = "mini_occurrences", "hbase.table.default.storage.type" = "binary" ); This suggests it understands the formats: hive> SELECT * FROM tim_hbase_occurrence LIMIT 3; OK 1444 Abies alba 1081 1445 Abies alba 1081 1446 Abies alba 1081 But doing any queries, suggest not: hive> SELECT * FROM tim_hbase_occurrence WHERE scientific_name='Abies alba' limit 3; ... NULL Abies alba NULL NULL Abies alba NULL NULL Abies alba NULL Time taken: 9.668 seconds hive> SELECT * FROM tim_hbase_occurrence WHERE data_resource_id=1081; ... 0 (no records) Can anyone provide any guidance on this please? Thanks! Tim
Re: Case Studies for 'Programming Hive' book from O'Reilly
Hi Jason, I work for an international organization involved in the mobilization of biodiversity data (specifically we are dealing a lot with observations of species) so think of it as a lot of point based information with metadata tags. We have built an Oozie workflow that uses Sqoop to suck in a few databases and then does a big transformation and set of quality control which we did using Hive and some custom UDFs. There is a blog introducing this on http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/ All our work and data are open, so I can freely write about any of it, and can link to real production code in Google svn. If it would be of interest to you I am happy to discuss what would be most useful to help write up for your book. Some possible angles you might consider: - real UDFs in action (e.g. parsing species scientific names) - UDTFs to generate a Google map tile cache - Hive in an ETL workflow to remove load from DBs - The pros and cons of calling web services from a UDF (we do it, but it keeps concerns clean and accept the risk of a DDoS we can control) - Sqoop and Hive together - We are getting into Hive on HBase and have found UDFs can help with type safety since we aren't running HIVE-1634 [with the advancements in Hive 0.9 I would think our workarounds are not worth documenting] - Metrics illustrating the importance of join order, and knowing data cardinality to ensure decent performance. Hope this is of interest, Tim On Wed, Apr 11, 2012 at 7:48 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Dear Hive User, > > We want your interesting case study for our upcoming book titled > 'Programming Hive' from O'Reilly. > > How you use Hive, either high level or low level code details are both > encouraged! > > Feel free to reach out with a brief abstract. > > Regards, > > Jason Rutherglen >
A GIS contains() for Hive?
Hi all, I need to perform a lot of "point in polygon" checks and want to use Hive (currently I mix Hive, Sqoop and PostGIS in an Oozie workto do this). In an ideal world, I would like to create a Hive table from a Shapefile containing polygons, and then do the likes of the following: SELECT p.id, pp.id FROM points p, polygons pp WHERE pp.contains(geom, toPoint(p.lat,p.lng)) Has anyone done anything along these lines? Alternatively I am capable of doing a UDF that would read the shape file into memory and basically do a map side join using something like a slab decomposition technique. It is more limited but would meet my needs allowing e.g.: SELECT contains(p.lat,p.lng, '/data/shapefiles/countries.shp') FROM points; Before I start I thought I'd ask folks as I suspect people are doing this kind of thing on Hive by now (thinking FB and user profiling by political boundaries etc) I'd love to hear from anyone who's investigated this or could provide any advice. Thanks! Tim
Copenhagen / Scandinavian HUG Meetup - any interest?
Hi all, (cross posted to a few Hadoop mailing lists - apologies for the SPAM) Are there any users around the Copenhagen area that would like a HUG meetup? Just reply with +1 and I'll gauge interest. We could probably host a 1/2 or full day if people were coming from Sweden... We are using Hadoop, Oozie, Hive, Pig, Sqoop in production, and getting into HBase now, and would like to chat with like minded folks. Cheers, Tim
Restoring configuration variables after issuing "set"
Hi all Can someone please tell me how to achieve the following in a single hive script? set original_value = mapred.reduce.tasks; set mapred.reduce.tasks=1; ... do stuff set mapred.reduce.tasks=original_value; It is the first and last lines that don't work - is it possible? Thanks, Tim
Restoring configuration variables after issuing "set"
Hi all Can someone please tell me how to achieve the following in a single hive script? set original_value = mapred.reduce.tasks; set mapred.reduce.tasks=1; ... do stuff set mapred.reduce.tasks=original_value; It is the first and last lines that don't work - is it possible? Thanks, Tim
UDFRowSequence called in Map() ?
Hi all, I am using UDFRowSequence as follows: CREATE TEMPORARY FUNCTION rowSequence AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; mapred.reduce.tasks=1; CREATE TABLE temp_tc1_test as SELECT rowSequence() AS id, data_resource_id, local_id, local_parent_id, name, author FROM normalized; I see 2 jobs, the first of which running with 2 map() and 0 reduce() on my small test data. I believe the rowSequence() is being called in the map and not the reduce as the results have duplicate IDs: select * from temp_tc1_test where id=8915; 8915167 11481113Cytospora elaeagni Allesch. 8915168 7 6 Achromadora inflata Abebe & Coomans, 1996 Is there any way to enforce the UDF is called in the reduce? Thanks, Tim
Opposite of explode?
Hi all, Sorry if I am missing something obvious but is there an inverse of an explode? E.g. given t1 ID Name 1 Tim 2 Tim 3 Tom 4 Frank 5 Tim Can you create t2: Name ID Tim1,2,5 Tom 3 Frank 4 In Oracle it would be a select name,collect(id) from t1 group by name I suspect in Hive it is related to an Array but can't find the syntax Thanks for any pointers, Tim
Re: Are there any examples of simple UDAFs?
What about the count or max? http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java I've not used UDAFs, but I only got my UDTFs working by looking at the examples in the Hive SVN itself. HTH, Tim On Wed, Dec 8, 2010 at 3:27 PM, Leo Alekseyev wrote: > I am trying to write a very simple aggregation function which seems > like an overkill for using GenericUDAF as described on the wiki. > However, I can't get the code to run. It always throws an exception > of the form > > java.lang.ClassNotFoundException: > com.hadoopbook.hive.Maximum$MaximumIntUDAFEvaluator > > (As you can see, even the sample code from Tom White's Hadoop book -- > the only sample code I found -- doesn't work.) I'm using the standard > add jar and create temporary function statements... > > --Leo >
Re: How to generate global unique ID?
Does it need to be a sequential INT? If not, then a UUID works very well. Cheers, Tim On Tue, Nov 16, 2010 at 8:55 AM, afancy wrote: > Hi, Zhang, > How to integrate this snowflake with Hive? Thanks! > Regards, > afancy > > On Mon, Nov 15, 2010 at 10:35 AM, Jeff Zhang wrote: >> >> Please refer https://github.com/twitter/snowflake >> >> >> >> On Mon, Nov 15, 2010 at 5:09 PM, afancy wrote: >> > Hi, >> > Does anybody know how to generate a unique ID in Hive? I have a table: >> > hive> describe results; >> > OK >> > id int >> > test string >> > Time taken: 0.091 seconds >> > >> > When I insert data into the table results, i want to generate a unqiue >> > Id, >> > which is like a sequence in RDBMS. >> > Regards, >> > afancy >> >> >> >> -- >> Best Regards >> >> Jeff Zhang > >
UDTF arguments limited to 10?
Is there a maximum limit of 10 fields in a UDTF? The following is always giving issues: public void process(Object[] args) throws HiveException { ... ((LazyInteger) args[10]).getWritableObject().get(); I am trying to do: create table density_cluster_ungrouped as select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id, class_concept_id, order_concept_id,family_concept_id, genus_concept_id, species_concept_id,nub_concept_id, latitude, longitude, count, 23) as (taxonId,tileX,tileY,zoom,clusterX,clusterY,count) from density_occurrence_grouped; Thanks, Tim
Re: Only a single expression in the SELECT clause is supported with UDTF's
Thanks Paul, Confirmed that it does indeed work like that. On Tue, Nov 9, 2010 at 12:13 AM, Paul Yang wrote: > In your original query, I think if you put parenthesis around p,k it should > have worked: > > select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) as (p,k) ... > > -Original Message- > From: Tim Robertson [mailto:timrobertson...@gmail.com] > Sent: Monday, November 08, 2010 5:53 AM > To: user@hive.apache.org > Subject: Re: Only a single expression in the SELECT clause is supported with > UDTF's > > Thank you once again Sonal. > > The following worked just nicely: > select k,p from temp_kingdom_phylum lateral view > taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) e as k,p; > > [For anyone stumbling across this thread: I also had to change the > close() and issue the forward() in the process andnot in the close() > which is against the example shipped with hive, but in accord with the > docs which say one must not do this] > > Cheers, > Tim > > > > On Mon, Nov 8, 2010 at 2:18 PM, Sonal Goyal wrote: >> Hi Tim, >> >> I guess you are running into limitations while using UDTFs. Check >> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#UDTF. I think you >> should be able to use lateral view in your query. >> >> Thanks and Regards, >> Sonal >> >> Sonal Goyal | Founder and CEO | Nube Technologies LLP >> http://www.nubetech.co >> http://code.google.com/p/hiho/ >> >> >> >> >> >> >> On Mon, Nov 8, 2010 at 2:11 PM, Tim Robertson >> wrote: >>> >>> Hi all, >>> >>> I am trying my first UDTF, but can't seem to get it to run. Can >>> anyone spot anything wrong with this please: >>> >>> hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) >>> as p,k from temp_kingdom_phylum; >>> FAILED: Error in semantic analysis: Only a single expression in the >>> SELECT clause is supported with UDTF's >>> hive> >>> >>> Below is my code. Thanks for any pointers, >>> >>> Tim >>> >>> >>> >>> @description( >>> name = "taxonDensityUDTF", >>> value = "_FUNC_(kingdom_concept_id, phylum_concept_id)" >>> ) >>> public class TaxonDensityUDTF extends GenericUDTF { >>> Integer kingdom_concept_id = Integer.valueOf(0); >>> Integer phylum_concept_id = Integer.valueOf(0); >>> >>> /** >>> * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close() >>> */ >>> �...@override >>> public void close() throws HiveException { >>> Object[] forwardObj = new Object[2]; >>> forwardObj[0] = kingdom_concept_id; >>> forwardObj[1] = phylum_concept_id; >>> forward(forwardObj); >>> // TEST STUFF FOR NOW >>> forwardObj = new Object[2]; >>> forwardObj[0] = kingdom_concept_id+1; >>> forwardObj[1] = phylum_concept_id+1; >>> forward(forwardObj); >>> } >>> >>> /** >>> * @see >>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[]) >>> */ >>> �...@override >>> public StructObjectInspector initialize(ObjectInspector[] arg0) >>> throws UDFArgumentException { >>> ArrayList fieldNames = new ArrayList(); >>> ArrayList fieldOIs = new >>> ArrayList(); >>> fieldNames.add("kingdom_concept_id"); >>> fieldNames.add("phylum_concept_id"); >>> >>> fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); >>> >>> fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); >>> return >>> ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs); >>> } >>> >>> /** >>> * @see >>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[]) >>> */ >>> �...@override >>> public void process(Object[] args) throws HiveException { >>> kingdom_concept_id = (Integer) args[0]; >>> phylum_concept_id = (Integer) args[1]; >>> } >>> } >> >> >
Re: Unions causing many scans of input - workaround?
For anyone stumbling across this thread the (pretty much) working UDTF can be seen: http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-spatial/src/main/java/org/gbif/occurrence/spatial/udf/TaxonDensityUDTF.java allowing for the following query: create table taxon_density as select taxonId,tileX,tileY,zoom,clusterX,clusterY,sum(count) from occurrence_record lateral view taxonDensityUDTF(kingdom_concept_id, phylum_concept_id, class_concept_id, order_concept_id, family_concept_id, genus_concept_id, species_concept_id,nub_concept_id, latitude, longitude, 10) e as taxonId,tileX,tileY,zoom,clusterX,clusterY,count group by taxonId,tileX,tileY,zoom,clusterX,clusterY; Thanks again for the pointers Sonal and Namit, and also on the other thread, Tim On Mon, Nov 8, 2010 at 9:17 AM, Tim Robertson wrote: > I am writing a GenericUDTF now, but notice on > http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF > > the method docs show: > /** > * Called to notify the UDTF that there are no more rows to process. Note > that > * forward() should not be called in this function. Only clean up code should > * be run. > */ > public abstract void close() throws HiveException; > > but the example does exactly that: > @Override > public void close() throws HiveException { > forwardObj[0] = count; > forward(forwardObj); > forward(forwardObj); > } > > I'll assume the example is correct and continue, but it might be worth > fixing that page. > > Cheers, > Tim > > > > > > On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson > wrote: >> Thank you both, >> >> A quick glance looks like that is what I am looking for. When I get >> it working, I'll post the solution. >> >> Cheers, >> Tim >> >> On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote: >>> Other option would be to create a wrapper script (not use either UDF or >>> UDTF) >>> That script, in any language, can emit any number of output rows per input >>> row. >>> >>> Look at: >>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform >>> for details >>> >>> >>> From: Sonal Goyal [sonalgoy...@gmail.com] >>> Sent: Sunday, November 07, 2010 8:40 PM >>> To: user@hive.apache.org >>> Subject: Re: Unions causing many scans of input - workaround? >>> >>> Hey Tim, >>> >>> You have an interesting problem. Have you tried creating a UDTF for your >>> case, so that you can possibly emit more than one record for each row of >>> your input? >>> >>> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF >>> >>> Thanks and Regards, >>> Sonal >>> >>> Sonal Goyal | Founder and CEO | Nube Technologies LLP >>> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal >>> >>> >>> >>> >>> >>> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson >>> wrote: >>>> >>>> Hi all, >>>> >>>> I am porting custom MR code to Hive and have written working UDFs >>>> where I need them. Is there a work around to having to do this in >>>> Hive: >>>> >>>> select * from >>>> ( >>>> select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as >>>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as >>>> f2_y, count (1) as count >>>> from table >>>> group by name_id, x, y, f2_x, f2_y >>>> >>>> UNION ALL >>>> >>>> select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as >>>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as >>>> f2_y, count (1) as count >>>> from table >>>> group by name_id, x, y, f2_x, f2_y >>>> >>>> --- etc etc increasing in zoom >>>> ) >>>> >>>> The issue being that this does many passes over the table, whereas >>>> previously in my Map() I would just emit many times from the same >>>> input record and then let it all group in the shuffle and sort. >>>> I actually emit 184 times for an input record (23 zoom levels of >>>> google maps, and 8 ways to derive the name_id) for a single record >>>> which means 184 union statements - Is it possible in hive to force it >>>> to emit many times from the source record in the stage-1 map? >>>> >>>> (ahem) Does anyone know if Pig can do this if not in Hive? >>>> >>>> I hope I have explained this well enough to make sense. >>>> >>>> Thanks in advance, >>>> Tim >>> >>> >> >
Re: Only a single expression in the SELECT clause is supported with UDTF's
Thank you once again Sonal. The following worked just nicely: select k,p from temp_kingdom_phylum lateral view taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) e as k,p; [For anyone stumbling across this thread: I also had to change the close() and issue the forward() in the process andnot in the close() which is against the example shipped with hive, but in accord with the docs which say one must not do this] Cheers, Tim On Mon, Nov 8, 2010 at 2:18 PM, Sonal Goyal wrote: > Hi Tim, > > I guess you are running into limitations while using UDTFs. Check > http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#UDTF. I think you > should be able to use lateral view in your query. > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co > http://code.google.com/p/hiho/ > > > > > > > On Mon, Nov 8, 2010 at 2:11 PM, Tim Robertson > wrote: >> >> Hi all, >> >> I am trying my first UDTF, but can't seem to get it to run. Can >> anyone spot anything wrong with this please: >> >> hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) >> as p,k from temp_kingdom_phylum; >> FAILED: Error in semantic analysis: Only a single expression in the >> SELECT clause is supported with UDTF's >> hive> >> >> Below is my code. Thanks for any pointers, >> >> Tim >> >> >> >> @description( >> name = "taxonDensityUDTF", >> value = "_FUNC_(kingdom_concept_id, phylum_concept_id)" >> ) >> public class TaxonDensityUDTF extends GenericUDTF { >> Integer kingdom_concept_id = Integer.valueOf(0); >> Integer phylum_concept_id = Integer.valueOf(0); >> >> /** >> * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close() >> */ >> �...@override >> public void close() throws HiveException { >> Object[] forwardObj = new Object[2]; >> forwardObj[0] = kingdom_concept_id; >> forwardObj[1] = phylum_concept_id; >> forward(forwardObj); >> // TEST STUFF FOR NOW >> forwardObj = new Object[2]; >> forwardObj[0] = kingdom_concept_id+1; >> forwardObj[1] = phylum_concept_id+1; >> forward(forwardObj); >> } >> >> /** >> * @see >> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[]) >> */ >> �...@override >> public StructObjectInspector initialize(ObjectInspector[] arg0) >> throws UDFArgumentException { >> ArrayList fieldNames = new ArrayList(); >> ArrayList fieldOIs = new >> ArrayList(); >> fieldNames.add("kingdom_concept_id"); >> fieldNames.add("phylum_concept_id"); >> >> fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); >> >> fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); >> return >> ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs); >> } >> >> /** >> * @see >> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[]) >> */ >> �...@override >> public void process(Object[] args) throws HiveException { >> kingdom_concept_id = (Integer) args[0]; >> phylum_concept_id = (Integer) args[1]; >> } >> } > >
Only a single expression in the SELECT clause is supported with UDTF's
Hi all, I am trying my first UDTF, but can't seem to get it to run. Can anyone spot anything wrong with this please: hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) as p,k from temp_kingdom_phylum; FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF's hive> Below is my code. Thanks for any pointers, Tim @description( name = "taxonDensityUDTF", value = "_FUNC_(kingdom_concept_id, phylum_concept_id)" ) public class TaxonDensityUDTF extends GenericUDTF { Integer kingdom_concept_id = Integer.valueOf(0); Integer phylum_concept_id = Integer.valueOf(0); /** * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close() */ @Override public void close() throws HiveException { Object[] forwardObj = new Object[2]; forwardObj[0] = kingdom_concept_id; forwardObj[1] = phylum_concept_id; forward(forwardObj); // TEST STUFF FOR NOW forwardObj = new Object[2]; forwardObj[0] = kingdom_concept_id+1; forwardObj[1] = phylum_concept_id+1; forward(forwardObj); } /** * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[]) */ @Override public StructObjectInspector initialize(ObjectInspector[] arg0) throws UDFArgumentException { ArrayList fieldNames = new ArrayList(); ArrayList fieldOIs = new ArrayList(); fieldNames.add("kingdom_concept_id"); fieldNames.add("phylum_concept_id"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs); } /** * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[]) */ @Override public void process(Object[] args) throws HiveException { kingdom_concept_id = (Integer) args[0]; phylum_concept_id = (Integer) args[1]; } }
Re: Unions causing many scans of input - workaround?
I am writing a GenericUDTF now, but notice on http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF the method docs show: /** * Called to notify the UDTF that there are no more rows to process. Note that * forward() should not be called in this function. Only clean up code should * be run. */ public abstract void close() throws HiveException; but the example does exactly that: @Override public void close() throws HiveException { forwardObj[0] = count; forward(forwardObj); forward(forwardObj); } I'll assume the example is correct and continue, but it might be worth fixing that page. Cheers, Tim On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson wrote: > Thank you both, > > A quick glance looks like that is what I am looking for. When I get > it working, I'll post the solution. > > Cheers, > Tim > > On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote: >> Other option would be to create a wrapper script (not use either UDF or >> UDTF) >> That script, in any language, can emit any number of output rows per input >> row. >> >> Look at: >> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform >> for details >> >> >> From: Sonal Goyal [sonalgoy...@gmail.com] >> Sent: Sunday, November 07, 2010 8:40 PM >> To: user@hive.apache.org >> Subject: Re: Unions causing many scans of input - workaround? >> >> Hey Tim, >> >> You have an interesting problem. Have you tried creating a UDTF for your >> case, so that you can possibly emit more than one record for each row of >> your input? >> >> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF >> >> Thanks and Regards, >> Sonal >> >> Sonal Goyal | Founder and CEO | Nube Technologies LLP >> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal >> >> >> >> >> >> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson >> wrote: >>> >>> Hi all, >>> >>> I am porting custom MR code to Hive and have written working UDFs >>> where I need them. Is there a work around to having to do this in >>> Hive: >>> >>> select * from >>> ( >>> select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as >>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as >>> f2_y, count (1) as count >>> from table >>> group by name_id, x, y, f2_x, f2_y >>> >>> UNION ALL >>> >>> select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as >>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as >>> f2_y, count (1) as count >>> from table >>> group by name_id, x, y, f2_x, f2_y >>> >>> --- etc etc increasing in zoom >>> ) >>> >>> The issue being that this does many passes over the table, whereas >>> previously in my Map() I would just emit many times from the same >>> input record and then let it all group in the shuffle and sort. >>> I actually emit 184 times for an input record (23 zoom levels of >>> google maps, and 8 ways to derive the name_id) for a single record >>> which means 184 union statements - Is it possible in hive to force it >>> to emit many times from the source record in the stage-1 map? >>> >>> (ahem) Does anyone know if Pig can do this if not in Hive? >>> >>> I hope I have explained this well enough to make sense. >>> >>> Thanks in advance, >>> Tim >> >> >
Re: Unions causing many scans of input - workaround?
Thank you both, A quick glance looks like that is what I am looking for. When I get it working, I'll post the solution. Cheers, Tim On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain wrote: > Other option would be to create a wrapper script (not use either UDF or > UDTF) > That script, in any language, can emit any number of output rows per input > row. > > Look at: > http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform > for details > > > From: Sonal Goyal [sonalgoy...@gmail.com] > Sent: Sunday, November 07, 2010 8:40 PM > To: user@hive.apache.org > Subject: Re: Unions causing many scans of input - workaround? > > Hey Tim, > > You have an interesting problem. Have you tried creating a UDTF for your > case, so that you can possibly emit more than one record for each row of > your input? > > http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson > wrote: >> >> Hi all, >> >> I am porting custom MR code to Hive and have written working UDFs >> where I need them. Is there a work around to having to do this in >> Hive: >> >> select * from >> ( >> select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as >> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as >> f2_y, count (1) as count >> from table >> group by name_id, x, y, f2_x, f2_y >> >> UNION ALL >> >> select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as >> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as >> f2_y, count (1) as count >> from table >> group by name_id, x, y, f2_x, f2_y >> >> --- etc etc increasing in zoom >> ) >> >> The issue being that this does many passes over the table, whereas >> previously in my Map() I would just emit many times from the same >> input record and then let it all group in the shuffle and sort. >> I actually emit 184 times for an input record (23 zoom levels of >> google maps, and 8 ways to derive the name_id) for a single record >> which means 184 union statements - Is it possible in hive to force it >> to emit many times from the source record in the stage-1 map? >> >> (ahem) Does anyone know if Pig can do this if not in Hive? >> >> I hope I have explained this well enough to make sense. >> >> Thanks in advance, >> Tim > >
Unions causing many scans of input - workaround?
Hi all, I am porting custom MR code to Hive and have written working UDFs where I need them. Is there a work around to having to do this in Hive: select * from ( select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as f2_y, count (1) as count from table group by name_id, x, y, f2_x, f2_y UNION ALL select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as f2_y, count (1) as count from table group by name_id, x, y, f2_x, f2_y --- etc etc increasing in zoom ) The issue being that this does many passes over the table, whereas previously in my Map() I would just emit many times from the same input record and then let it all group in the shuffle and sort. I actually emit 184 times for an input record (23 zoom levels of google maps, and 8 ways to derive the name_id) for a single record which means 184 union statements - Is it possible in hive to force it to emit many times from the source record in the stage-1 map? (ahem) Does anyone know if Pig can do this if not in Hive? I hope I have explained this well enough to make sense. Thanks in advance, Tim
Re: NOT IN query
Please try this in Hive: select distinct a.id from tableA a LEFT OUTER join tableB b on a.id=b.id where b.id is null Cheers, Tim On Wed, Nov 3, 2010 at 1:19 PM, Tim Robertson wrote: > In SQL you use a left join: > > # so in mysql: > select distinct a.id from tableA a left join tableB b on a.id=b.id > where b.id is null > > Not sure exactly how that ports to Hive, but it should be something > along those lines. > > HTH, > Tim > > > On Wed, Nov 3, 2010 at 1:13 PM, איל (Eyal) wrote: >> Hi, >> >> I have a table A with some values and another table B with some other values >> >> How do I get all the distinct values from A that are NOT in B >> >> e.g >> >> if table A has values 1,2,3,4,1,2,3,5,6,7 and B has values 2,3,4,5,6 >> >> then the result should be 1,7 >> >> Thanks >> Eyal >> >
Re: NOT IN query
In SQL you use a left join: # so in mysql: select distinct a.id from tableA a left join tableB b on a.id=b.id where b.id is null Not sure exactly how that ports to Hive, but it should be something along those lines. HTH, Tim On Wed, Nov 3, 2010 at 1:13 PM, איל (Eyal) wrote: > Hi, > > I have a table A with some values and another table B with some other values > > How do I get all the distinct values from A that are NOT in B > > e.g > > if table A has values 1,2,3,4,1,2,3,5,6,7 and B has values 2,3,4,5,6 > > then the result should be 1,7 > > Thanks > Eyal >
Re: UDF help
Thanks Edward. I'll poke around there. On Tue, Nov 2, 2010 at 6:40 PM, Edward Capriolo wrote: > On Tue, Nov 2, 2010 at 12:47 PM, Tim Robertson > wrote: >> Hi all, >> >> Is the following a valid UDF please? >> >> When I run it I get the following so I presume not: >> hive> select toGoogleCoords(latitude,longitude,1) from >> raw_occurrence_record limit 100; >> FAILED: Error in semantic analysis: >> java.lang.IllegalArgumentException: Error: name expected at the >> position 7 of 'struct<>' but '>' is found. >> >> Is it possible to return an Array from a UDF? >> >> Thanks for any pointers, >> Tim >> >> >> >> >> public class GoogleTileCoordsUDF extends UDF { >> >> public IntWritable[] evaluate(Text latitude, Text longitude, >> IntWritable zoomLevel) { >> if (latitude == null >> || longitude == null >> || zoomLevel == null) { >> return null; >> } >> >> double lat = Double.parseDouble(latitude.toString()); >> double lng = Double.parseDouble(longitude.toString()); >> >> Point p = GoogleTileUtil.toTileXY(lat, lng, zoomLevel.get()); >> >> if (p==null) { >> return null; >> } >> >> IntWritable[] xy = new IntWritable[2]; >> xy[0]=new IntWritable(p.x); >> xy[1]=new IntWritable(p.y); >> return xy; >> } >> } >> > > I believe a UDF can not return an Array type. You may have to use the > GenericUDF interface for that. >
UDF help
Hi all, Is the following a valid UDF please? When I run it I get the following so I presume not: hive> select toGoogleCoords(latitude,longitude,1) from raw_occurrence_record limit 100; FAILED: Error in semantic analysis: java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<>' but '>' is found. Is it possible to return an Array from a UDF? Thanks for any pointers, Tim public class GoogleTileCoordsUDF extends UDF { public IntWritable[] evaluate(Text latitude, Text longitude, IntWritable zoomLevel) { if (latitude == null || longitude == null || zoomLevel == null) { return null; } double lat = Double.parseDouble(latitude.toString()); double lng = Double.parseDouble(longitude.toString()); Point p = GoogleTileUtil.toTileXY(lat, lng, zoomLevel.get()); if (p==null) { return null; } IntWritable[] xy = new IntWritable[2]; xy[0]=new IntWritable(p.x); xy[1]=new IntWritable(p.y); return xy; } }
Re: HBase as input AND output?
That's right. Hive can use an HBase table as an input format to the hive query regardless of output format, and can also write the output to an HBase table regardless of the input format. You can also supposedly do a join in Hive that uses 1 side of the join from an HBase table, and the other side a text file, which is very powerful. I haven't done it myself, but intend to shortly. HTH, Tim On Wed, Oct 13, 2010 at 10:07 PM, Otis Gospodnetic wrote: > Hi, > > I was wondering how I can query data stored in HBase and remembered Hive's > HBase > integration: > http://wiki.apache.org/hadoop/Hive/HBaseIntegration > > After watching John Sichi's video > (http://developer.yahoo.com/blogs/hadoop/posts/2010/04/hundreds_of_hadoop_fans_at_the/ > ) I have a better idea about what functionality this integration provides, > but > I still have some questions. > > Would it be correct to say that Hive-HBase integration makes the following > data > flow possible: > > 0) Hive or Files => Custom HQL statement that aggregates data ==> HBase > 1) HBase ==> Custom HQL statement that aggregates data ==> HBase > 2) HBase ==> Custom HQL statement that aggregates data ==> output (console?) > > Of the above, 1) is what I'm wondering the most about right now. > > In other words, it seems to me that Hive may be able to look at *just* data > stored in HBase *without* the typical data/files in HDFS that Hive normally > runs > its MR jobs against. > > Is this correct? > > Thanks, > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Hadoop ecosystem search :: http://search-hadoop.com/ > >