Re: [VOTE] Bylaws change to allow some commits without review

2016-04-19 Thread Tim Robertson
+1

On Wed, Apr 20, 2016 at 1:24 AM, Jimmy Xiang  wrote:

> +1
>
> On Tue, Apr 19, 2016 at 2:58 PM, Alpesh Patel 
> wrote:
> > +1
> >
> > On Tue, Apr 19, 2016 at 1:29 PM, Lars Francke 
> > wrote:
> >>
> >> Thanks everyone! Vote runs for at least one more day. I'd appreciate it
> if
> >> you could ping/bump your colleagues to chime in here.
> >>
> >> I'm not entirely sure how many PMC members are active and how many votes
> >> we need but I think a few more are probably needed.
> >>
> >> On Mon, Apr 18, 2016 at 8:02 PM, Thejas Nair 
> >> wrote:
> >>>
> >>> +1
> >>>
> >>> 
> >>> From: Wei Zheng 
> >>> Sent: Monday, April 18, 2016 10:51 AM
> >>> To: user@hive.apache.org
> >>> Subject: Re: [VOTE] Bylaws change to allow some commits without review
> >>>
> >>> +1
> >>>
> >>> Thanks,
> >>> Wei
> >>>
> >>> From: Siddharth Seth 
> >>> Reply-To: "user@hive.apache.org" 
> >>> Date: Monday, April 18, 2016 at 10:29
> >>> To: "user@hive.apache.org" 
> >>> Subject: Re: [VOTE] Bylaws change to allow some commits without review
> >>>
> >>> +1
> >>>
> >>> On Wed, Apr 13, 2016 at 3:58 PM, Lars Francke 
> >>> wrote:
> 
>  Hi everyone,
> 
>  we had a discussion on the dev@ list about allowing some forms of
>  contributions to be committed without a review.
> 
>  The exact sentence I propose to add is: "Minor issues (e.g. typos,
> code
>  style issues, JavaDoc changes. At committer's discretion) can be
> committed
>  after soliciting feedback/review on the mailing list and not receiving
>  feedback within 2 days."
> 
>  The proposed bylaws can also be seen here
>  <
> https://cwiki.apache.org/confluence/display/Hive/Proposed+Changes+to+Hive+Project+Bylaws+-+April+2016
> >
> 
>  This vote requires a 2/3 majority of all Active PMC members so I'd
> love
>  to get as many votes as possible. The vote will run for at least six
> days.
> 
>  Thanks,
>  Lars
> >>>
> >>>
> >>
> >
>


Writing HFiles using Hive for an HBase bulk load - possible bug?

2016-04-19 Thread Tim Robertson
Hi folks,

I am trying to create HFiles from a Hive table to bulk load into HBase and
am following the HWX [1] tutorial.

It creates the HFiles correctly but then fails when closing the
RecordWriter with the following stack trace.

Error: java.lang.RuntimeException: Hive Runtime Error while closing
operators: java.io.IOException: Multiple family directories found in hdfs://
c1n1.gbif.org:8020/user/hive/warehouse/tim.db/coords_hbase/_temporary/2/_temporary
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:295)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

The reason is that the HFiles are created in the task attempt folder but it
is looking 2 directories above the task attempt for the HFile.

Does anyone else see this behaviour please?

I logged it as a bug [2], which also details exactly my procedure, but I
wonder if someone could confirm that they also see this, or if perhaps I am
just doing something wrong and it works for them?

Thanks all,
Tim



[1]
https://community.hortonworks.com/articles/2745/creating-hbase-hfiles-from-an-existing-hive-table.html

[2] https://issues.apache.org/jira/browse/HIVE-13539


Re: UDF error?

2013-09-30 Thread Tim Robertson
Here is an example of a no arg that will return a different value for each
row:
https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UuidUDF.java

Hope this helps,
Tim


On Mon, Sep 30, 2013 at 10:59 PM, Yang  wrote:

> thanks!  at first I did have a no-arg evaluate(), but somehow
>
> select myfunction(), field1, field2 from mytable ;
>
> spits out the same value for myfunction() for each row. so I was wondering
> whether the UDF got called only 1 time, because the hive compiler sees that
> the argument is void, so that
> all the invocations would be "having the same value", then I tried to pass
> in a param to prevent this possibility.
>
>
> On Mon, Sep 30, 2013 at 1:55 PM, Tim Robertson 
> wrote:
>
>> It's been ages since I wrote one, but the differences to mine:
>>
>> a) I use LongWritable:  public LongWritable evaluate(LongWritable
>> startAt) {
>> b) I have annotations on the class (but I think they are just for docs)
>> @Description(name = "row_sequence",
>>   value = "_FUNC_() - Returns a generated row sequence number starting
>> from 1")
>> @UDFType(deterministic = false)
>> public class UDFRowSequence extends UDF {
>>
>> Hope this helps!
>> Tim
>>
>>
>>
>> On Mon, Sep 30, 2013 at 10:47 PM, Yang  wrote:
>>
>>> I wrote a super simple UDF, but got some errors:
>>>
>>> UDF:
>>>
>>> package yy;
>>> import org.apache.hadoop.hive.ql.exec.UDF;
>>> import java.util.Random;
>>> import java.util.UUID;
>>> import java.lang.management.*;
>>>
>>> public class MyUdf extends UDF {
>>> static Random rand = new Random(System.currentTimeMillis() +
>>> Thread.currentThread().getId()* 100);
>>> String name = ManagementFactory.getRuntimeMXBean().getName();
>>> long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) *
>>> 1 + Thread.currentThread().getId() * 1000;
>>> public long evaluate(long x ) {
>>> //return (long)UUID.randomUUID().hashCode();
>>> //return rand.nextLong();
>>> return startValue++;
>>>  }
>>> }
>>>
>>>
>>>
>>>
>>>
>>> sql script:
>>>
>>> CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf';
>>> select gen_uniq2(field1), field2
>>> from yy_mapping limit 10;
>>>
>>> field1 is bigint, field2 is int
>>>
>>>
>>>
>>>
>>>
>>> error:
>>>
>>> hive> source aa.sql;
>>> Added ./MyUdf.jar to class path
>>> Added resource: ./MyUdf.jar
>>> OK
>>> Time taken: 0.0070 seconds
>>> FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments
>>> 'field1': No matching method for class yy.MyUdf with (bigint). Possible
>>> choices: _FUNC_()
>>>
>>>
>>>
>>>
>>>
>>> so I'm declaring a UDF with arg of long, so that should work for a
>>> bigint (more importantly it's complaining not long vs bigint, but bigint vs
>>> void ). I tried changing both to int, same failure
>>>
>>>
>>> thanks!
>>> yang
>>>
>>>
>>>
>>
>


Re: UDF error?

2013-09-30 Thread Tim Robertson
That class is:
https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/UDFRowSequence.java

Cheers,
Tim


On Mon, Sep 30, 2013 at 10:55 PM, Tim Robertson
wrote:

> It's been ages since I wrote one, but the differences to mine:
>
> a) I use LongWritable:  public LongWritable evaluate(LongWritable startAt)
> {
> b) I have annotations on the class (but I think they are just for docs)
> @Description(name = "row_sequence",
>   value = "_FUNC_() - Returns a generated row sequence number starting
> from 1")
> @UDFType(deterministic = false)
> public class UDFRowSequence extends UDF {
>
> Hope this helps!
> Tim
>
>
>
> On Mon, Sep 30, 2013 at 10:47 PM, Yang  wrote:
>
>> I wrote a super simple UDF, but got some errors:
>>
>> UDF:
>>
>> package yy;
>> import org.apache.hadoop.hive.ql.exec.UDF;
>> import java.util.Random;
>> import java.util.UUID;
>> import java.lang.management.*;
>>
>> public class MyUdf extends UDF {
>> static Random rand = new Random(System.currentTimeMillis() +
>> Thread.currentThread().getId()* 100);
>> String name = ManagementFactory.getRuntimeMXBean().getName();
>> long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) *
>> 1 + Thread.currentThread().getId() * 1000;
>> public long evaluate(long x ) {
>> //return (long)UUID.randomUUID().hashCode();
>> //return rand.nextLong();
>> return startValue++;
>>  }
>> }
>>
>>
>>
>>
>>
>> sql script:
>>
>> CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf';
>> select gen_uniq2(field1), field2
>> from yy_mapping limit 10;
>>
>> field1 is bigint, field2 is int
>>
>>
>>
>>
>>
>> error:
>>
>> hive> source aa.sql;
>> Added ./MyUdf.jar to class path
>> Added resource: ./MyUdf.jar
>> OK
>> Time taken: 0.0070 seconds
>> FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments
>> 'field1': No matching method for class yy.MyUdf with (bigint). Possible
>> choices: _FUNC_()
>>
>>
>>
>>
>>
>> so I'm declaring a UDF with arg of long, so that should work for a bigint
>> (more importantly it's complaining not long vs bigint, but bigint vs void
>> ). I tried changing both to int, same failure
>>
>>
>> thanks!
>> yang
>>
>>
>>
>


Re: UDF error?

2013-09-30 Thread Tim Robertson
It's been ages since I wrote one, but the differences to mine:

a) I use LongWritable:  public LongWritable evaluate(LongWritable startAt) {
b) I have annotations on the class (but I think they are just for docs)
@Description(name = "row_sequence",
  value = "_FUNC_() - Returns a generated row sequence number starting from
1")
@UDFType(deterministic = false)
public class UDFRowSequence extends UDF {

Hope this helps!
Tim



On Mon, Sep 30, 2013 at 10:47 PM, Yang  wrote:

> I wrote a super simple UDF, but got some errors:
>
> UDF:
>
> package yy;
> import org.apache.hadoop.hive.ql.exec.UDF;
> import java.util.Random;
> import java.util.UUID;
> import java.lang.management.*;
>
> public class MyUdf extends UDF {
> static Random rand = new Random(System.currentTimeMillis() +
> Thread.currentThread().getId()* 100);
> String name = ManagementFactory.getRuntimeMXBean().getName();
> long startValue = Long.valueOf(name.replaceAll("[^\\d]+", "")) *
> 1 + Thread.currentThread().getId() * 1000;
> public long evaluate(long x ) {
> //return (long)UUID.randomUUID().hashCode();
> //return rand.nextLong();
> return startValue++;
>  }
> }
>
>
>
>
>
> sql script:
>
> CREATE TEMPORARY FUNCTION gen_uniq2 AS 'yy.MyUdf';
> select gen_uniq2(field1), field2
> from yy_mapping limit 10;
>
> field1 is bigint, field2 is int
>
>
>
>
>
> error:
>
> hive> source aa.sql;
> Added ./MyUdf.jar to class path
> Added resource: ./MyUdf.jar
> OK
> Time taken: 0.0070 seconds
> FAILED: SemanticException [Error 10014]: Line 2:7 Wrong arguments
> 'field1': No matching method for class yy.MyUdf with (bigint). Possible
> choices: _FUNC_()
>
>
>
>
>
> so I'm declaring a UDF with arg of long, so that should work for a bigint
> (more importantly it's complaining not long vs bigint, but bigint vs void
> ). I tried changing both to int, same failure
>
>
> thanks!
> yang
>
>
>


Re: Questions about Hive

2012-09-17 Thread Tim Robertson
I don't think Hive is intended for web request scoped operations... that
would be a rather unusual case from my understanding.

HBase sounds more like the Hadoop equivalent that you might be looking for,
but you need to look at your search patterns to see if HBase is a good fit
(you need to manage your own indexes again).

Cheers,
Tim


On Mon, Sep 17, 2012 at 8:07 AM, Something Something <
mailinglist...@gmail.com> wrote:

> Thank you both for the answers.  We are trying to find out if Hive can be
> used as a replacement of Netezza, but if there are no indexes then I don't
> see how it will beat Netezza in terms of performance.  Sounds like it
> certainly can't be used to do a quick lookup from a webapp - like Netezza
> can.
>
> If performance isn't a concern, then I guess it could be a useful tool.
> Will try it out & see how it works out.  Thanks.
>
>
>
> On Sun, Sep 16, 2012 at 10:51 PM, Tim Robertson  > wrote:
>
>> Note:  I am a newbie to Hive.
>>>
>>> Can someone please answer the following questions?
>>>
>>> 1)  Does Hive provide APIs (like HBase does) that can be used to
>>> retrieve data from the tables in Hive from a Java program?  I heard
>>> somewhere that the data can be accessed with JDBC (style) APIs.  True?
>>>
>>
>> True.
>> https://cwiki.apache.org/Hive/hiveclient.html#HiveClient-JDBC
>>
>>
>>> 2)  I don't see how I can add indexes on the tables, so does that mean a
>>> query such as the following will trigger a MR job that will search files on
>>> HDFS sequentially?
>>>
>>> hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
>>>
>>>
>> There are some index implementations in hive, but it is not as simple as
>> a traditional db.
>> E.g. Search Jira and see some of the work:
>> https://issues.apache.org/jira/browse/HIVE-417
>>
>> You are correct that the above would do a full table scan
>>
>> 3)  Has anyone compared performance of Hive against other NOSQL databases
>>> such as HBase, MongoDB.  I understand it's not exactly apples to apples
>>> comparison, but still...
>>>
>>
>> I think you misunderstand what Hive is.  It is a basically a SQL to MR
>> translation engine, which has adapters for the input source.  By default it
>> uses simple files on the HDFS, but there is (e.g.) HBase adapters, so you
>> can use it to run SQL on HBase tables for example (which works great).
>>  Regarding performance, on the HBase scans, the operation is the same as
>> running a normal HBase MR scan, so is the same.
>>
>>
>>>
>>> Thanks.
>>
>>
>>
>


Re: Questions about Hive

2012-09-16 Thread Tim Robertson
>
> Note:  I am a newbie to Hive.
>
> Can someone please answer the following questions?
>
> 1)  Does Hive provide APIs (like HBase does) that can be used to retrieve
> data from the tables in Hive from a Java program?  I heard somewhere that
> the data can be accessed with JDBC (style) APIs.  True?
>

True.
https://cwiki.apache.org/Hive/hiveclient.html#HiveClient-JDBC


> 2)  I don't see how I can add indexes on the tables, so does that mean a
> query such as the following will trigger a MR job that will search files on
> HDFS sequentially?
>
> hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
>
>
There are some index implementations in hive, but it is not as simple as a
traditional db.
E.g. Search Jira and see some of the work:
https://issues.apache.org/jira/browse/HIVE-417

You are correct that the above would do a full table scan

3)  Has anyone compared performance of Hive against other NOSQL databases
> such as HBase, MongoDB.  I understand it's not exactly apples to apples
> comparison, but still...
>

I think you misunderstand what Hive is.  It is a basically a SQL to MR
translation engine, which has adapters for the input source.  By default it
uses simple files on the HDFS, but there is (e.g.) HBase adapters, so you
can use it to run SQL on HBase tables for example (which works great).
 Regarding performance, on the HBase scans, the operation is the same as
running a normal HBase MR scan, so is the same.


>
> Thanks.


Data-local map tasks - reported correctly in JT?

2012-05-02 Thread Tim Robertson
Hi all,

I have a 6 node cluster, and on a simple query created with a table from a
CSV, I was seeing a lot of mappers reporting that they were not using data
locality.
I changed the replication factor to 6 but still MR is showing only about
60% data locality in the data-local map tasks.

How can this be when I have no under replicated blocks, and replication
count the same as the machine count?  Am I missing something?  Does it
indicate that something is wrong in the MR configuration (E.g. A TT not
recognizing localhost for DN for example)?

The 6 machines each have 12 spindles in them and I'm running Hive 0.7 and
0.9 trunk built about 2 weeks ago.

Many thanks!
Tim


Re: Hive can't find some tables

2012-04-18 Thread Tim Robertson
It sounds like you have run Sqoop without specifying a durable metastore
for Hive.  E.g. you haven't told Hive to use MySQL, PostGRES etc to store
it's metadata.  It probably used Derby DB which either put it all in
memory, or put it all on the /tmp directory, which was destroyed on restart.

I would imagine you need to sqoop in again, after you correct this.

HTH,
Tim



On Wed, Apr 18, 2012 at 5:29 AM, Đỗ Hoàng Khiêm wrote:

> HI, I have some problems with Hive, looks like Hive cannot read some of my
> tables which was imported before by Sqoop. After importing from Sqoop it's
> still able to read/query from these tables, but after I restarted machine,
> Hive cannot find these tables anymore (as *show tables* didn't list these
> tables) while it's still able to create new tables. I have checked in HDFS,
> these table files still existed at /user/hive/warehouse/.
>
> When start hive shell then try to show tables, the hive log printed out:
>
> 2012-04-18 10:25:11,412 WARN  conf.HiveConf (HiveConf.java:(68)) -
> hive-site.xml not found on CLASSPATH
> 2012-04-18 10:25:43,239 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.core.resources" but it cannot be resolved.
> 2012-04-18 10:25:43,239 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.core.resources" but it cannot be resolved.
> 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.core.runtime" but it cannot be resolved.
> 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.core.runtime" but it cannot be resolved.
> 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.text" but it cannot be resolved.
> 2012-04-18 10:25:43,240 ERROR DataNucleus.Plugin
> (Log4JLogger.java:error(115)) - Bundle "org.eclipse.jdt.core" requires
> "org.eclipse.text" but it cannot be resolved.
>
> How can I recover these table?
>
> Thanks.
>
>
>


Re: Can we define external table Fields enclosed in "

2012-04-18 Thread Tim Robertson
Hi again,

How about defining a table (t1) with ~ as the delimiter and then creating a
view to that table which uses the regexp_replace UDF?

CREATE VIEW v_user_access AS
SELECT regexp_replace(ip, "\"", "") as ip, ...
FROM t1;

Not sure the implications on joining, but basic queries should work ok I
would think.

HTH,
Tim






On Wed, Apr 18, 2012 at 9:20 PM, Gopi Kodumur  wrote:

> Thanks Tim, Sorry for not explaining the problem clearly...
>
> I have data in this format ,  I wanted to store the data in Ext-Hive table
> without the Double Quote
>
>
> "127.0.0.17"~"444c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d
>
> "127.0.0.12"~"544c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d
>
> "127.0.0.13"~"644c1c9a-8820-11e1-aaa8-00219b8a879e"~"2012-04-17T00:00:01Z"~"476825ea-8820-11e1-a105-0200ac1d1c3d
>
>
>   --
> *From:* Tim Robertson 
> *To:* user@hive.apache.org; Gopi Kodumur 
> *Sent:* Wednesday, April 18, 2012 12:14 PM
> *Subject:* Re: Can we define external table Fields enclosed in "
>
> I believe so.  From the tutorial [1] :
>
>CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
> page_url STRING, referrer_url STRING,
> ip STRING COMMENT 'IP Address of the User',
> country STRING COMMENT 'country of origination')
> COMMENT 'This is the staging page view table'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
>
>
> HTH,
> Tim
>
> [1] https://cwiki.apache.org/confluence/display/Hive/Tutorial
>
>
>
> On Tue, Apr 17, 2012 at 11:20 PM, Gopi Kodumur  wrote:
>
> Is it possible to specify enclosed by character " for fields , while
> defining external table
>
> Thanks
> Gopi
>
>
>
>
>


Re: Can we define external table Fields enclosed in "

2012-04-18 Thread Tim Robertson
I believe so.  From the tutorial [1] :

   CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'


HTH,
Tim

[1] https://cwiki.apache.org/confluence/display/Hive/Tutorial



On Tue, Apr 17, 2012 at 11:20 PM, Gopi Kodumur  wrote:

> Is it possible to specify enclosed by character " for fields , while
> defining external table
>
> Thanks
> Gopi
>
>


Re: Help with INTEGER on HBase (Hive-1634)

2012-04-17 Thread Tim Robertson
Apologies, it does indeed work when you add the correct JARs in Hive.

Tim



On Tue, Apr 17, 2012 at 3:33 PM, Tim Robertson wrote:

> Hi all,
>
> I am *really* interested in Hive-1634 (
> https://issues.apache.org/jira/browse/HIVE-1634).  I have just built from
> Hive trunk using HBase 0.90.4 as the version (e.g. we run cdh3u2).
>
> We have an HBase table populated with Bytes, so I create the Hive table
> like so:
>
> CREATE EXTERNAL TABLE tim_hbase_occurrence (
>   id int,
>   scientific_name string,
>   data_resource_id int
> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
> SERDEPROPERTIES (
>   "hbase.columns.mapping" = ":key,v:scientific_name,v:data_resource_id",
>   "hbase.columns.storage.types" = "b,b,b"
> ) TBLPROPERTIES(
>   "hbase.table.name" = "mini_occurrences",
>   "hbase.table.default.storage.type" = "binary"
> );
>
> This suggests it understands the formats:
>
> hive> SELECT * FROM tim_hbase_occurrence LIMIT 3;
> OK
> 1444 Abies alba 1081
> 1445 Abies alba 1081
> 1446 Abies alba 1081
>
> But doing any queries, suggest not:
>
> hive> SELECT * FROM tim_hbase_occurrence WHERE scientific_name='Abies
> alba' limit 3;
> ...
> NULL Abies alba NULL
> NULL Abies alba NULL
> NULL Abies alba NULL
> Time taken: 9.668 seconds
>
> hive> SELECT * FROM tim_hbase_occurrence WHERE data_resource_id=1081;
> ...
> 0 (no records)
>
> Can anyone provide any guidance on this please?
>
> Thanks!
> Tim
>
>


Help with INTEGER on HBase (Hive-1634)

2012-04-17 Thread Tim Robertson
Hi all,

I am *really* interested in Hive-1634 (
https://issues.apache.org/jira/browse/HIVE-1634).  I have just built from
Hive trunk using HBase 0.90.4 as the version (e.g. we run cdh3u2).

We have an HBase table populated with Bytes, so I create the Hive table
like so:

CREATE EXTERNAL TABLE tim_hbase_occurrence (
  id int,
  scientific_name string,
  data_resource_id int
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES (
  "hbase.columns.mapping" = ":key,v:scientific_name,v:data_resource_id",
  "hbase.columns.storage.types" = "b,b,b"
) TBLPROPERTIES(
  "hbase.table.name" = "mini_occurrences",
  "hbase.table.default.storage.type" = "binary"
);

This suggests it understands the formats:

hive> SELECT * FROM tim_hbase_occurrence LIMIT 3;
OK
1444 Abies alba 1081
1445 Abies alba 1081
1446 Abies alba 1081

But doing any queries, suggest not:

hive> SELECT * FROM tim_hbase_occurrence WHERE scientific_name='Abies alba'
limit 3;
...
NULL Abies alba NULL
NULL Abies alba NULL
NULL Abies alba NULL
Time taken: 9.668 seconds

hive> SELECT * FROM tim_hbase_occurrence WHERE data_resource_id=1081;
...
0 (no records)

Can anyone provide any guidance on this please?

Thanks!
Tim


Re: Case Studies for 'Programming Hive' book from O'Reilly

2012-04-15 Thread Tim Robertson
Hi Jason,

I work for an international organization involved in the mobilization of
biodiversity data (specifically we are dealing a lot with observations of
species) so think of it as a lot of point based information with metadata
tags.  We have built an Oozie workflow that uses Sqoop to suck in a few
databases and then does a big transformation and set of quality control
which we did using Hive and some custom UDFs.  There is a blog introducing
this on
http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/

All our work and data are open, so I can freely write about any of it, and
can link to real production code in Google svn.

If it would be of interest to you I am happy to discuss what would be most
useful to help write up for your book.  Some possible angles you might
consider:
- real UDFs in action (e.g. parsing species scientific names)
- UDTFs to generate a Google map tile cache
- Hive in an ETL workflow to remove load from DBs
- The pros and cons of calling web services from a UDF (we do it, but it
keeps concerns clean and accept the risk of a DDoS we can control)
- Sqoop and Hive together
- We are getting into Hive on HBase and have found UDFs can help with type
safety since we aren't running HIVE-1634
  [with the advancements in Hive 0.9 I would think our workarounds are not
worth documenting]
- Metrics illustrating the importance of join order, and knowing data
cardinality to ensure decent performance.

Hope this is of interest,
Tim





On Wed, Apr 11, 2012 at 7:48 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Dear Hive User,
>
> We want your interesting case study for our upcoming book titled
> 'Programming Hive' from O'Reilly.
>
> How you use Hive, either high level or low level code details are both
> encouraged!
>
> Feel free to reach out with a brief abstract.
>
> Regards,
>
> Jason Rutherglen
>


A GIS contains() for Hive?

2012-03-16 Thread Tim Robertson
Hi all,

I need to perform a lot of "point in polygon" checks and want to use Hive
(currently I mix Hive, Sqoop and PostGIS in an Oozie workto do this).

In an ideal world, I would like to create a Hive table from a Shapefile
containing polygons, and then do the likes of the following:

  SELECT p.id, pp.id FROM points p, polygons pp WHERE pp.contains(geom,
toPoint(p.lat,p.lng))

Has anyone done anything along these lines?

Alternatively I am capable of doing a UDF that would read the shape file
into memory and basically do a map side join using something like a slab
decomposition technique.  It is more limited but would meet my needs
allowing e.g.:

  SELECT contains(p.lat,p.lng, '/data/shapefiles/countries.shp') FROM
points;

Before I start I thought I'd ask folks as I suspect people are doing this
kind of thing on Hive by now (thinking FB and user profiling by political
boundaries etc)

I'd love to hear from anyone who's investigated this or could provide any
advice.

Thanks!
Tim


Copenhagen / Scandinavian HUG Meetup - any interest?

2012-02-13 Thread Tim Robertson
Hi all,
(cross posted to a few Hadoop mailing lists - apologies for the SPAM)

Are there any users around the Copenhagen area that would like a HUG meetup?

Just reply with +1 and I'll gauge interest.  We could probably host a
1/2 or full day if people were coming from Sweden...

We are using Hadoop, Oozie, Hive, Pig, Sqoop in production, and
getting into HBase now, and would like to chat with like minded folks.

Cheers,
Tim


Restoring configuration variables after issuing "set"

2011-03-12 Thread Tim Robertson
Hi all

Can someone please tell me how to achieve the following in a single hive script?

set original_value = mapred.reduce.tasks;
set mapred.reduce.tasks=1;
... do stuff
set mapred.reduce.tasks=original_value;

It is the first and last lines that don't work - is it possible?

Thanks,
Tim


Restoring configuration variables after issuing "set"

2011-03-12 Thread Tim Robertson
Hi all

Can someone please tell me how to achieve the following in a single hive script?

set original_value = mapred.reduce.tasks;
set mapred.reduce.tasks=1;
... do stuff
set mapred.reduce.tasks=original_value;

It is the first and last lines that don't work - is it possible?

Thanks,
Tim


UDFRowSequence called in Map() ?

2011-02-20 Thread Tim Robertson
Hi all,

I am using UDFRowSequence as follows:

CREATE TEMPORARY FUNCTION rowSequence AS
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
mapred.reduce.tasks=1;
CREATE TABLE temp_tc1_test
as
SELECT
  rowSequence() AS id,
  data_resource_id,
  local_id,
  local_parent_id,
  name,
  author
FROM normalized;

I see 2 jobs, the first of which running with 2 map() and 0 reduce()
on my small test data.  I believe the rowSequence() is being called in
the map and not the reduce as the results have duplicate IDs:

select * from temp_tc1_test where id=8915;
8915167 11481113Cytospora elaeagni  Allesch.
8915168 7   6   Achromadora inflata Abebe & Coomans, 1996

Is there any way to enforce the UDF is called in the reduce?

Thanks,
Tim


Opposite of explode?

2011-02-10 Thread Tim Robertson
Hi all,

Sorry if I am missing something obvious but is there an inverse of an explode?

E.g. given t1

ID Name
1  Tim
2  Tim
3  Tom
4  Frank
5  Tim

Can you create t2:

Name ID
Tim1,2,5
Tom   3
Frank 4

In Oracle it would be a
  select name,collect(id) from t1 group by name

I suspect in Hive it is related to an Array but can't find the syntax

Thanks for any pointers,
Tim


Re: Are there any examples of simple UDAFs?

2010-12-07 Thread Tim Robertson
What about the count or max?
  
http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCount.java
  
http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMax.java

I've not used UDAFs, but I only got my UDTFs working by looking at the
examples in the Hive SVN itself.

HTH,
Tim


On Wed, Dec 8, 2010 at 3:27 PM, Leo Alekseyev  wrote:
> I am trying to write a very simple aggregation function which seems
> like an overkill for using GenericUDAF as described on the wiki.
> However, I can't get the code to run.  It always throws an exception
> of the form
>
> java.lang.ClassNotFoundException:
> com.hadoopbook.hive.Maximum$MaximumIntUDAFEvaluator
>
> (As you can see, even the sample code from Tom White's Hadoop book --
> the only sample code I found -- doesn't work.)  I'm using the standard
> add jar and create temporary function statements...
>
> --Leo
>


Re: How to generate global unique ID?

2010-11-16 Thread Tim Robertson
Does it need to be a sequential INT? If not, then a UUID works very well.

Cheers,
Tim


On Tue, Nov 16, 2010 at 8:55 AM, afancy  wrote:
> Hi, Zhang,
> How to integrate this snowflake with Hive?  Thanks!
> Regards,
> afancy
>
> On Mon, Nov 15, 2010 at 10:35 AM, Jeff Zhang  wrote:
>>
>> Please refer https://github.com/twitter/snowflake
>>
>>
>>
>> On Mon, Nov 15, 2010 at 5:09 PM, afancy  wrote:
>> > Hi,
>> > Does anybody know how to generate a unique ID in Hive? I have a table:
>> > hive> describe results;
>> > OK
>> > id int
>> > test string
>> > Time taken: 0.091 seconds
>> >
>> > When I insert data into the table results, i want to generate a unqiue
>> > Id,
>> > which is like a sequence in RDBMS.
>> > Regards,
>> > afancy
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>
>


UDTF arguments limited to 10?

2010-11-11 Thread Tim Robertson
Is there a maximum limit of 10 fields in a UDTF?

The following is always giving issues:

public void process(Object[] args) throws HiveException {
  ...
  ((LazyInteger) args[10]).getWritableObject().get();


I am trying to do:
create table density_cluster_ungrouped
as select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id,
class_concept_id, order_concept_id,family_concept_id,
genus_concept_id, species_concept_id,nub_concept_id, latitude,
longitude, count, 23) as
(taxonId,tileX,tileY,zoom,clusterX,clusterY,count)
from density_occurrence_grouped;

Thanks,
Tim


Re: Only a single expression in the SELECT clause is supported with UDTF's

2010-11-09 Thread Tim Robertson
Thanks Paul,

Confirmed that it does indeed work like that.

On Tue, Nov 9, 2010 at 12:13 AM, Paul Yang  wrote:
> In your original query, I think if you put parenthesis around p,k it should 
> have worked:
>
> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) as (p,k) ...
>
> -Original Message-
> From: Tim Robertson [mailto:timrobertson...@gmail.com]
> Sent: Monday, November 08, 2010 5:53 AM
> To: user@hive.apache.org
> Subject: Re: Only a single expression in the SELECT clause is supported with 
> UDTF's
>
> Thank you once again Sonal.
>
> The following worked just nicely:
>  select k,p from temp_kingdom_phylum lateral view
> taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) e as k,p;
>
> [For anyone stumbling across this thread: I also had to change the
> close() and issue the forward() in the process andnot in the close()
> which is against the example shipped with hive, but in accord with the
> docs which say one must not do this]
>
> Cheers,
> Tim
>
>
>
> On Mon, Nov 8, 2010 at 2:18 PM, Sonal Goyal  wrote:
>> Hi Tim,
>>
>> I guess you are running into limitations while using UDTFs. Check
>> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#UDTF. I think you
>> should be able to use lateral view in your query.
>>
>> Thanks and Regards,
>> Sonal
>>
>> Sonal Goyal | Founder and CEO | Nube Technologies LLP
>> http://www.nubetech.co
>> http://code.google.com/p/hiho/
>>
>>
>>
>>
>>
>>
>> On Mon, Nov 8, 2010 at 2:11 PM, Tim Robertson 
>> wrote:
>>>
>>> Hi all,
>>>
>>> I am trying my first UDTF, but can't seem to get it to run.  Can
>>> anyone spot anything wrong with this please:
>>>
>>> hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id)
>>> as p,k from temp_kingdom_phylum;
>>> FAILED: Error in semantic analysis: Only a single expression in the
>>> SELECT clause is supported with UDTF's
>>> hive>
>>>
>>> Below is my code.  Thanks for any pointers,
>>>
>>> Tim
>>>
>>>
>>>
>>> @description(
>>>                  name = "taxonDensityUDTF",
>>>                  value = "_FUNC_(kingdom_concept_id, phylum_concept_id)"
>>>                )
>>> public class TaxonDensityUDTF extends GenericUDTF {
>>>        Integer kingdom_concept_id = Integer.valueOf(0);
>>>        Integer phylum_concept_id = Integer.valueOf(0);
>>>
>>>        /**
>>>         * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close()
>>>         */
>>>       �...@override
>>>        public void close() throws HiveException {
>>>                Object[] forwardObj = new Object[2];
>>>                forwardObj[0] = kingdom_concept_id;
>>>                forwardObj[1] = phylum_concept_id;
>>>                forward(forwardObj);
>>>                // TEST STUFF FOR NOW
>>>                forwardObj = new Object[2];
>>>                forwardObj[0] = kingdom_concept_id+1;
>>>                forwardObj[1] = phylum_concept_id+1;
>>>                forward(forwardObj);
>>>        }
>>>
>>>        /**
>>>         * @see
>>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[])
>>>         */
>>>       �...@override
>>>        public StructObjectInspector initialize(ObjectInspector[] arg0)
>>> throws UDFArgumentException {
>>>                ArrayList fieldNames = new ArrayList();
>>>                ArrayList fieldOIs = new
>>> ArrayList();
>>>                fieldNames.add("kingdom_concept_id");
>>>                fieldNames.add("phylum_concept_id");
>>>
>>>  fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
>>>
>>>  fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
>>>                return
>>> ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
>>>        }
>>>
>>>        /**
>>>         * @see
>>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[])
>>>         */
>>>       �...@override
>>>        public void process(Object[] args) throws HiveException {
>>>                kingdom_concept_id = (Integer) args[0];
>>>                phylum_concept_id = (Integer) args[1];
>>>        }
>>> }
>>
>>
>


Re: Unions causing many scans of input - workaround?

2010-11-08 Thread Tim Robertson
For anyone stumbling across this thread the (pretty much) working UDTF
can be seen:
  
http://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-spatial/src/main/java/org/gbif/occurrence/spatial/udf/TaxonDensityUDTF.java

allowing for the following query:

create table taxon_density as
select taxonId,tileX,tileY,zoom,clusterX,clusterY,sum(count) from
occurrence_record lateral view taxonDensityUDTF(kingdom_concept_id,
phylum_concept_id, class_concept_id, order_concept_id,
family_concept_id, genus_concept_id,
species_concept_id,nub_concept_id, latitude, longitude, 10) e
as taxonId,tileX,tileY,zoom,clusterX,clusterY,count
group by taxonId,tileX,tileY,zoom,clusterX,clusterY;

Thanks again for the pointers Sonal and Namit, and also on the other thread,

Tim




On Mon, Nov 8, 2010 at 9:17 AM, Tim Robertson  wrote:
> I am writing a GenericUDTF now, but notice on
> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>
> the method docs show:
> /**
>   * Called to notify the UDTF that there are no more rows to process. Note 
> that
>   * forward() should not be called in this function. Only clean up code should
>   * be run.
>   */
>  public abstract void close() throws HiveException;
>
> but the example does exactly that:
> @Override
>  public void close() throws HiveException {
>    forwardObj[0] = count;
>    forward(forwardObj);
>    forward(forwardObj);
>  }
>
> I'll assume the example is correct and continue, but it might be worth
> fixing that page.
>
> Cheers,
> Tim
>
>
>
>
>
> On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson  
> wrote:
>> Thank you both,
>>
>> A quick glance looks like that is what I am looking for.  When I get
>> it working, I'll post the solution.
>>
>> Cheers,
>> Tim
>>
>> On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain  wrote:
>>> Other option would be to create a wrapper script (not use either UDF or
>>> UDTF)
>>> That script, in any language, can emit any number of output rows per input
>>> row.
>>>
>>> Look at:
>>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
>>> for details
>>>
>>> 
>>> From: Sonal Goyal [sonalgoy...@gmail.com]
>>> Sent: Sunday, November 07, 2010 8:40 PM
>>> To: user@hive.apache.org
>>> Subject: Re: Unions causing many scans of input - workaround?
>>>
>>> Hey Tim,
>>>
>>> You have an interesting problem. Have you tried creating a UDTF for your
>>> case, so that you can possibly emit more than one record for each row of
>>> your input?
>>>
>>> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>>>
>>> Thanks and Regards,
>>> Sonal
>>>
>>> Sonal Goyal | Founder and CEO | Nube Technologies LLP
>>> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson 
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I am porting custom MR code to Hive and have written working UDFs
>>>> where I need them.  Is there a work around to having to do this in
>>>> Hive:
>>>>
>>>> select * from
>>>> (
>>>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>>>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>>>> f2_y, count (1) as count
>>>>    from table
>>>>    group by name_id, x, y, f2_x, f2_y
>>>>
>>>>    UNION ALL
>>>>
>>>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>>>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>>>> f2_y, count (1) as count
>>>>    from table
>>>>    group by name_id, x, y, f2_x, f2_y
>>>>
>>>>   --- etc etc increasing in zoom
>>>> )
>>>>
>>>> The issue being that this does many passes over the table, whereas
>>>> previously in my Map() I would just emit many times from the same
>>>> input record and then let it all group in the shuffle and sort.
>>>> I actually emit 184 times for an input record (23 zoom levels of
>>>> google maps, and 8 ways to derive the name_id) for a single record
>>>> which means 184 union statements - Is it possible in hive to force it
>>>> to emit many times from the source record in the stage-1 map?
>>>>
>>>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>>>
>>>> I hope I have explained this well enough to make sense.
>>>>
>>>> Thanks in advance,
>>>> Tim
>>>
>>>
>>
>


Re: Only a single expression in the SELECT clause is supported with UDTF's

2010-11-08 Thread Tim Robertson
Thank you once again Sonal.

The following worked just nicely:
  select k,p from temp_kingdom_phylum lateral view
taxonDensityUDTF(kingdom_concept_id, phylum_concept_id) e as k,p;

[For anyone stumbling across this thread: I also had to change the
close() and issue the forward() in the process andnot in the close()
which is against the example shipped with hive, but in accord with the
docs which say one must not do this]

Cheers,
Tim



On Mon, Nov 8, 2010 at 2:18 PM, Sonal Goyal  wrote:
> Hi Tim,
>
> I guess you are running into limitations while using UDTFs. Check
> http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#UDTF. I think you
> should be able to use lateral view in your query.
>
> Thanks and Regards,
> Sonal
>
> Sonal Goyal | Founder and CEO | Nube Technologies LLP
> http://www.nubetech.co
> http://code.google.com/p/hiho/
>
>
>
>
>
>
> On Mon, Nov 8, 2010 at 2:11 PM, Tim Robertson 
> wrote:
>>
>> Hi all,
>>
>> I am trying my first UDTF, but can't seem to get it to run.  Can
>> anyone spot anything wrong with this please:
>>
>> hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id)
>> as p,k from temp_kingdom_phylum;
>> FAILED: Error in semantic analysis: Only a single expression in the
>> SELECT clause is supported with UDTF's
>> hive>
>>
>> Below is my code.  Thanks for any pointers,
>>
>> Tim
>>
>>
>>
>> @description(
>>                  name = "taxonDensityUDTF",
>>                  value = "_FUNC_(kingdom_concept_id, phylum_concept_id)"
>>                )
>> public class TaxonDensityUDTF extends GenericUDTF {
>>        Integer kingdom_concept_id = Integer.valueOf(0);
>>        Integer phylum_concept_id = Integer.valueOf(0);
>>
>>        /**
>>         * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close()
>>         */
>>       �...@override
>>        public void close() throws HiveException {
>>                Object[] forwardObj = new Object[2];
>>                forwardObj[0] = kingdom_concept_id;
>>                forwardObj[1] = phylum_concept_id;
>>                forward(forwardObj);
>>                // TEST STUFF FOR NOW
>>                forwardObj = new Object[2];
>>                forwardObj[0] = kingdom_concept_id+1;
>>                forwardObj[1] = phylum_concept_id+1;
>>                forward(forwardObj);
>>        }
>>
>>        /**
>>         * @see
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[])
>>         */
>>       �...@override
>>        public StructObjectInspector initialize(ObjectInspector[] arg0)
>> throws UDFArgumentException {
>>                ArrayList fieldNames = new ArrayList();
>>                ArrayList fieldOIs = new
>> ArrayList();
>>                fieldNames.add("kingdom_concept_id");
>>                fieldNames.add("phylum_concept_id");
>>
>>  fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
>>
>>  fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
>>                return
>> ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
>>        }
>>
>>        /**
>>         * @see
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[])
>>         */
>>       �...@override
>>        public void process(Object[] args) throws HiveException {
>>                kingdom_concept_id = (Integer) args[0];
>>                phylum_concept_id = (Integer) args[1];
>>        }
>> }
>
>


Only a single expression in the SELECT clause is supported with UDTF's

2010-11-08 Thread Tim Robertson
Hi all,

I am trying my first UDTF, but can't seem to get it to run.  Can
anyone spot anything wrong with this please:

hive> select taxonDensityUDTF(kingdom_concept_id, phylum_concept_id)
as p,k from temp_kingdom_phylum;
FAILED: Error in semantic analysis: Only a single expression in the
SELECT clause is supported with UDTF's
hive>

Below is my code.  Thanks for any pointers,

Tim



@description(
  name = "taxonDensityUDTF",
  value = "_FUNC_(kingdom_concept_id, phylum_concept_id)"
)
public class TaxonDensityUDTF extends GenericUDTF {
Integer kingdom_concept_id = Integer.valueOf(0);
Integer phylum_concept_id = Integer.valueOf(0);

/**
 * @see org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#close()
 */
@Override
public void close() throws HiveException {
Object[] forwardObj = new Object[2];
forwardObj[0] = kingdom_concept_id;
forwardObj[1] = phylum_concept_id;
forward(forwardObj);
// TEST STUFF FOR NOW
forwardObj = new Object[2];
forwardObj[0] = kingdom_concept_id+1;
forwardObj[1] = phylum_concept_id+1;
forward(forwardObj);
}

/**
 * @see 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#initialize(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector[])
 */
@Override
public StructObjectInspector initialize(ObjectInspector[] arg0)
throws UDFArgumentException {
ArrayList fieldNames = new ArrayList();
ArrayList fieldOIs = new 
ArrayList();
fieldNames.add("kingdom_concept_id");
fieldNames.add("phylum_concept_id");

fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);

fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
return 
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
}

/**
 * @see 
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF#process(java.lang.Object[])
 */
@Override
public void process(Object[] args) throws HiveException {
kingdom_concept_id = (Integer) args[0];
phylum_concept_id = (Integer) args[1];
}
}


Re: Unions causing many scans of input - workaround?

2010-11-08 Thread Tim Robertson
I am writing a GenericUDTF now, but notice on
http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF

the method docs show:
/**
   * Called to notify the UDTF that there are no more rows to process. Note that
   * forward() should not be called in this function. Only clean up code should
   * be run.
   */
  public abstract void close() throws HiveException;

but the example does exactly that:
@Override
  public void close() throws HiveException {
forwardObj[0] = count;
forward(forwardObj);
forward(forwardObj);
  }

I'll assume the example is correct and continue, but it might be worth
fixing that page.

Cheers,
Tim





On Mon, Nov 8, 2010 at 7:35 AM, Tim Robertson  wrote:
> Thank you both,
>
> A quick glance looks like that is what I am looking for.  When I get
> it working, I'll post the solution.
>
> Cheers,
> Tim
>
> On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain  wrote:
>> Other option would be to create a wrapper script (not use either UDF or
>> UDTF)
>> That script, in any language, can emit any number of output rows per input
>> row.
>>
>> Look at:
>> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
>> for details
>>
>> 
>> From: Sonal Goyal [sonalgoy...@gmail.com]
>> Sent: Sunday, November 07, 2010 8:40 PM
>> To: user@hive.apache.org
>> Subject: Re: Unions causing many scans of input - workaround?
>>
>> Hey Tim,
>>
>> You have an interesting problem. Have you tried creating a UDTF for your
>> case, so that you can possibly emit more than one record for each row of
>> your input?
>>
>> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>>
>> Thanks and Regards,
>> Sonal
>>
>> Sonal Goyal | Founder and CEO | Nube Technologies LLP
>> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>>
>>
>>
>>
>>
>> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson 
>> wrote:
>>>
>>> Hi all,
>>>
>>> I am porting custom MR code to Hive and have written working UDFs
>>> where I need them.  Is there a work around to having to do this in
>>> Hive:
>>>
>>> select * from
>>> (
>>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>>> f2_y, count (1) as count
>>>    from table
>>>    group by name_id, x, y, f2_x, f2_y
>>>
>>>    UNION ALL
>>>
>>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>>> f2_y, count (1) as count
>>>    from table
>>>    group by name_id, x, y, f2_x, f2_y
>>>
>>>   --- etc etc increasing in zoom
>>> )
>>>
>>> The issue being that this does many passes over the table, whereas
>>> previously in my Map() I would just emit many times from the same
>>> input record and then let it all group in the shuffle and sort.
>>> I actually emit 184 times for an input record (23 zoom levels of
>>> google maps, and 8 ways to derive the name_id) for a single record
>>> which means 184 union statements - Is it possible in hive to force it
>>> to emit many times from the source record in the stage-1 map?
>>>
>>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>>
>>> I hope I have explained this well enough to make sense.
>>>
>>> Thanks in advance,
>>> Tim
>>
>>
>


Re: Unions causing many scans of input - workaround?

2010-11-07 Thread Tim Robertson
Thank you both,

A quick glance looks like that is what I am looking for.  When I get
it working, I'll post the solution.

Cheers,
Tim

On Mon, Nov 8, 2010 at 6:55 AM, Namit Jain  wrote:
> Other option would be to create a wrapper script (not use either UDF or
> UDTF)
> That script, in any language, can emit any number of output rows per input
> row.
>
> Look at:
> http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
> for details
>
> 
> From: Sonal Goyal [sonalgoy...@gmail.com]
> Sent: Sunday, November 07, 2010 8:40 PM
> To: user@hive.apache.org
> Subject: Re: Unions causing many scans of input - workaround?
>
> Hey Tim,
>
> You have an interesting problem. Have you tried creating a UDTF for your
> case, so that you can possibly emit more than one record for each row of
> your input?
>
> http://wiki.apache.org/hadoop/Hive/DeveloperGuide/UDTF
>
> Thanks and Regards,
> Sonal
>
> Sonal Goyal | Founder and CEO | Nube Technologies LLP
> http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal
>
>
>
>
>
> On Mon, Nov 8, 2010 at 2:31 AM, Tim Robertson 
> wrote:
>>
>> Hi all,
>>
>> I am porting custom MR code to Hive and have written working UDFs
>> where I need them.  Is there a work around to having to do this in
>> Hive:
>>
>> select * from
>> (
>>    select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
>> y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
>> f2_y, count (1) as count
>>    from table
>>    group by name_id, x, y, f2_x, f2_y
>>
>>    UNION ALL
>>
>>    select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
>> y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
>> f2_y, count (1) as count
>>    from table
>>    group by name_id, x, y, f2_x, f2_y
>>
>>   --- etc etc increasing in zoom
>> )
>>
>> The issue being that this does many passes over the table, whereas
>> previously in my Map() I would just emit many times from the same
>> input record and then let it all group in the shuffle and sort.
>> I actually emit 184 times for an input record (23 zoom levels of
>> google maps, and 8 ways to derive the name_id) for a single record
>> which means 184 union statements - Is it possible in hive to force it
>> to emit many times from the source record in the stage-1 map?
>>
>> (ahem) Does anyone know if Pig can do this if not in Hive?
>>
>> I hope I have explained this well enough to make sense.
>>
>> Thanks in advance,
>> Tim
>
>


Unions causing many scans of input - workaround?

2010-11-07 Thread Tim Robertson
Hi all,

I am porting custom MR code to Hive and have written working UDFs
where I need them.  Is there a work around to having to do this in
Hive:

select * from
(
select name_id, toTileX(longitude,0) as x, toTileY(latitude,0) as
y, 0 as zoom, funct2(lontgitude, 0) as f2_x, funct2(latitude,0) as
f2_y, count (1) as count
from table
group by name_id, x, y, f2_x, f2_y

UNION ALL

select name_id, toTileX(longitude,1) as x, toTileY(latitude,1) as
y, 1 as zoom, funct2(lontgitude, 1) as f2_x, funct2(latitude,1) as
f2_y, count (1) as count
from table
group by name_id, x, y, f2_x, f2_y

   --- etc etc increasing in zoom
)

The issue being that this does many passes over the table, whereas
previously in my Map() I would just emit many times from the same
input record and then let it all group in the shuffle and sort.
I actually emit 184 times for an input record (23 zoom levels of
google maps, and 8 ways to derive the name_id) for a single record
which means 184 union statements - Is it possible in hive to force it
to emit many times from the source record in the stage-1 map?

(ahem) Does anyone know if Pig can do this if not in Hive?

I hope I have explained this well enough to make sense.

Thanks in advance,
Tim


Re: NOT IN query

2010-11-03 Thread Tim Robertson
Please try this in Hive:
select distinct a.id from tableA a LEFT OUTER join tableB b on
a.id=b.id where b.id is null

Cheers,
Tim


On Wed, Nov 3, 2010 at 1:19 PM, Tim Robertson  wrote:
> In SQL you use a left join:
>
> # so in mysql:
> select distinct a.id from tableA a left join tableB b on a.id=b.id
> where b.id is null
>
> Not sure exactly how that ports to Hive, but it should be something
> along those lines.
>
> HTH,
> Tim
>
>
> On Wed, Nov 3, 2010 at 1:13 PM, איל (Eyal)  wrote:
>> Hi,
>>
>> I have a table A with some values and another table B with some other values
>>
>> How do I get all the distinct values from A that are NOT in B
>>
>> e.g
>>
>> if table A has values 1,2,3,4,1,2,3,5,6,7  and B has values 2,3,4,5,6
>>
>> then the result should be 1,7
>>
>> Thanks
>>   Eyal
>>
>


Re: NOT IN query

2010-11-03 Thread Tim Robertson
In SQL you use a left join:

# so in mysql:
select distinct a.id from tableA a left join tableB b on a.id=b.id
where b.id is null

Not sure exactly how that ports to Hive, but it should be something
along those lines.

HTH,
Tim


On Wed, Nov 3, 2010 at 1:13 PM, איל (Eyal)  wrote:
> Hi,
>
> I have a table A with some values and another table B with some other values
>
> How do I get all the distinct values from A that are NOT in B
>
> e.g
>
> if table A has values 1,2,3,4,1,2,3,5,6,7  and B has values 2,3,4,5,6
>
> then the result should be 1,7
>
> Thanks
>   Eyal
>


Re: UDF help

2010-11-02 Thread Tim Robertson
Thanks Edward.  I'll poke around there.



On Tue, Nov 2, 2010 at 6:40 PM, Edward Capriolo  wrote:
> On Tue, Nov 2, 2010 at 12:47 PM, Tim Robertson
>  wrote:
>> Hi all,
>>
>> Is the following a valid UDF please?
>>
>> When I run it I get the following so I presume not:
>> hive> select toGoogleCoords(latitude,longitude,1) from
>> raw_occurrence_record limit 100;
>> FAILED: Error in semantic analysis:
>> java.lang.IllegalArgumentException: Error: name expected at the
>> position 7 of 'struct<>' but '>' is found.
>>
>> Is it possible to return an Array from a UDF?
>>
>> Thanks for any pointers,
>> Tim
>>
>>
>>
>>
>> public class GoogleTileCoordsUDF extends UDF {
>>
>>  public IntWritable[] evaluate(Text latitude, Text longitude,
>> IntWritable zoomLevel) {
>>    if (latitude == null
>>                || longitude == null
>>                || zoomLevel == null) {
>>      return null;
>>    }
>>
>>    double lat = Double.parseDouble(latitude.toString());
>>    double lng = Double.parseDouble(longitude.toString());
>>
>>    Point p = GoogleTileUtil.toTileXY(lat, lng, zoomLevel.get());
>>
>>    if (p==null) {
>>      return null;
>>    }
>>
>>    IntWritable[] xy = new IntWritable[2];
>>    xy[0]=new IntWritable(p.x);
>>    xy[1]=new IntWritable(p.y);
>>    return xy;
>>  }
>> }
>>
>
> I believe a UDF can not return an Array type. You may have to use the
> GenericUDF interface for that.
>


UDF help

2010-11-02 Thread Tim Robertson
Hi all,

Is the following a valid UDF please?

When I run it I get the following so I presume not:
hive> select toGoogleCoords(latitude,longitude,1) from
raw_occurrence_record limit 100;
FAILED: Error in semantic analysis:
java.lang.IllegalArgumentException: Error: name expected at the
position 7 of 'struct<>' but '>' is found.

Is it possible to return an Array from a UDF?

Thanks for any pointers,
Tim




public class GoogleTileCoordsUDF extends UDF {

  public IntWritable[] evaluate(Text latitude, Text longitude,
IntWritable zoomLevel) {
if (latitude == null
|| longitude == null
|| zoomLevel == null) {
  return null;
}

double lat = Double.parseDouble(latitude.toString());
double lng = Double.parseDouble(longitude.toString());

Point p = GoogleTileUtil.toTileXY(lat, lng, zoomLevel.get());

if (p==null) {
  return null;
}

IntWritable[] xy = new IntWritable[2];
xy[0]=new IntWritable(p.x);
xy[1]=new IntWritable(p.y);
return xy;
  }
}


Re: HBase as input AND output?

2010-10-13 Thread Tim Robertson
That's right.  Hive can use an HBase table as an input format to the
hive query regardless of output format, and can also write the output
to an HBase table regardless of the input format.  You can also
supposedly do a join in Hive that uses 1 side of the join from an
HBase table, and the other side a text file, which is very powerful.
I haven't done it myself, but intend to shortly.

HTH,
Tim


On Wed, Oct 13, 2010 at 10:07 PM, Otis Gospodnetic
 wrote:
> Hi,
>
> I was wondering how I can query data stored in HBase and remembered Hive's 
> HBase
> integration:
> http://wiki.apache.org/hadoop/Hive/HBaseIntegration
>
> After watching John Sichi's video
> (http://developer.yahoo.com/blogs/hadoop/posts/2010/04/hundreds_of_hadoop_fans_at_the/
>  ) I have a better idea about what functionality this integration provides, 
> but
> I still have some questions.
>
> Would it be correct to say that Hive-HBase integration makes the following 
> data
> flow possible:
>
> 0) Hive or Files => Custom HQL statement that aggregates data  ==> HBase
> 1) HBase ==> Custom HQL statement that aggregates data  ==> HBase
> 2) HBase ==> Custom HQL statement that aggregates data  ==> output (console?)
>
> Of the above, 1) is what I'm wondering the most about right now.
>
> In other words, it seems to me that Hive may be able to look at *just* data
> stored in HBase *without* the typical data/files in HDFS that Hive normally 
> runs
> its MR jobs against.
>
> Is this correct?
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>